Proceedings
Transcription
Proceedings
Proceedings of Pervasive 2004 Workshop on Memory and Sharing of Experiences April 20, 2004, Vienna, Austria http://www.ii.ist.i.kyoto-u.ac.jp/~sumi/pervasive04/ Proceedings of Pervasive 2004 Workshop on Memory and Sharing of Experiences April 20th, 2004 Vienna, Austria http://www.ii.ist.i.kyoto-u.ac.jp/~sumi/pervasive04/ Organizers Kenji Mase (Nagoya University / ATR) Yasuyuki Sumi (Kyoto University / ATR) Sidney Fels (University of British Columbia) Program Committee Kiyoharu Aizawa (The University of Tokyo) Jeremy Cooperstock (McGill University) Richard DeVaul (Massachusetts Institute of Technology) Jim Gemmell (Microsoft) Yasuyuki Kono (Nara Institute of Science and Technology ) Bernt Schiele (ETH Zurich) Thad Starner (Georgia Institute of Technology) Terry Winograd (Stanford University) Supported by ATR Media Information Science Laboratories 2-2-2 Hikaridai, Keihanna Science City, Kyoto 619-0288 JAPAN Phone: +81-774-95-1401 Fax : +81-774-95-1408 http://www.mis.atr.jp/ ISBN: 4-902401-01-0 Preface Welcome to the Pervasive 2004 workshop on Memory and Sharing of Experience (MSE2004). The purpose of the workshop is to provide an opportunity to exchange research results and to foster ideas in the emerging field of ubiquitous experience recording technologies with the goal of effective experience sharing. Pervasive computing environments provide essential infrastructure to record experiences of people working and playing in the real world. Underlying the infrastructure for recording experiences into extensive logs are ubiquitous sensor networks and effective tagging systems. This recorded experience becomes a life-memory, and, as a communication medium, is sharable with others to enhance our sense of community. Memory and sharing of experience is emerging as an important application of the pervasive computing era. Moreover, the research emphasis from Memory and Sharing of Experience encompasses many exciting research areas such as multimedia memory aids, reference for context recognition, life-pattern modeling, and storytelling of life both in science and technology. The workshop addresses the following topics: method and devices to capture the experience; storage and database of experience for recollection; experience and interaction corpora; experience log applications; privacy issues and other related areas. The workshop consists of 17 interesting presentations selected for presentation by peer review from submitted articles. The selection was done by the Program Committee to meet the limited time and space allotted for the one-day workshop associated to the Pervasive 2004. Unfortunately, many interesting works in the unselected papers could not fit into the program. Each submission was reviewed by two or more PC members. We very much appreciate and thank all the participants who put their time and effort to submit papers and the PC members for their reviews. We would like to thank ATR Media Information Science Laboratories and NICT (National Institute of Information and Communications Technology) for their support to publish this workshop record in a printed form. We look forward to the workshop providing a rich environment for academia and industry to foster active collaboration in the development of ubiquitous media technologies focused on Memory and Sharing of Experience. MSE2004 Workshop Program Committee Co-chairs Kenji Mase Nagoya University / ATR Media Information Science Laboratories Yasuyuki Sumi Kyoto University / ATR Media Information Science Laboratories Sidney Fels University of British Columbia iii Pervasive 2004 Workshop on Memory and Sharing Experience SCHEDULE Tuesday, April 20, 2004 9:00-10:35 Session 1: Introduction & Capturing Experineces Introduction to Memory and Sharing of Experiences Kenji Mase, Yasuyuki Sumi, Sidney Fels Collaborative Capturing and Interpretation of Interactions (L) Yasuyuki Sumi, Sadanori Ito, Tetsuya Matsuguchi, Sidney Fels, Kenji Mase Context Annotation for a Live Life Recording (L) Nicky Kern, Bernt Schiele, Holger Junker, Paul Lukowicz, Gerhard Tröster, Albrecht Schmidt Capture and Efficient Retrieval of Life Log (L) Kiyoharu Aizawa, Tetsuro Hori, Shinya Kawasaki, Takayuki Ishikawa 11:05-12:25 Session 2: Recollecting Personal Memory Exploring Graspable Cues for Everyday Recollecting Elise van den Hoven Remembrance Home: Storage for Re-discovering One's Life (L) Yasuyuki Kono, Kaoru Misaki An Object-centric Storytelling Framework Using Ubiquitous Sensor Technology (L) Norman Lin, Kenji Mase, Yasuyuki Sumi v Storing and Replaying Experiences in Mixed Environments using Hypermedia Nuno Correia, Luis Alves, Jorge Santiago, Luis Romero Storing, Indexing and Retrieving My Autobiography Alberto Frigo 14:00-15:35 Session 3: Utilizing Experiences Sharing Experience and Knowledge with Wearable Computers Marcus Nilsson, Mikael Drugge, Peter Parnes (L) Sharing Multimedia and Context Information between Mobile Terminals Jani Mäntyjärvi, Heikki Keränen, Tapani Rantakokko Using an Extended Episodic Memory Within a Mobile Companion (L) Alexander Kröner, Stephan Baldes, Anthony Jameson, Mathias Bauer u-Photo: A Design and Implementation of a Snapshot Based Method for Capturing Contextual Information (L) Takeshi Iwamoto, Genta Suzuki, Shun Aoki, Naohiko Kohtake, Kazunori Takashio, Hideyuki Tokuda The Re: living Map - an Effective Experience with GPS Tracking and Photographs Yoshimasa Niwa, Takafumi Iwai, Yuichiro Haraguchi, Masa Inakage 16:05-17:30 Session 4: Fundamental and Social Issues & Discussion Relational Analysis among Experiences and Real World Objects in the Ubiquitous Memories Environment Tatsuyuki Kawamura, Takahiro Ueoka, Yasuyuki Kono, Masatsugu vi Kidode A Framework for Personalizing Action History Viewer Masaki Ito, Jin Nakazawa, Hideyuki Tokuda Providing Privacy While Being Connected (L) Natalia A. Romero, Panos Markopoulos Capturing Conversational Participation in a Ubiquitous Sensor Environment Yasuhiro Katagiri, Mayumi Bono, Noriko Suzuki * "L" denotes long presentation vii Pervasive 2004 Workshop on Memory and Sharing Experience Table of Contents Collaborative Capturing and Interpretation of Interactions ········································· 1 Yasuyuki Sumi, Sadanori Ito, Tetsuya Matsuguchi, Sidney Fels, Kenji Mase Context Annotation for a Live Life Recording ···························································· 9 Nicky Kern, Bernt Schiele, Holger Junker, Paul Lukowicz, Gerhard Tröster, Albrecht Schmidt Capture and Efficient Retrieval of Life Log ····························································· 15 Kiyoharu Aizawa, Tetsuro Hori, Shinya Kawasaki, Takayuki Ishikawa Exploring Graspable Cues for Everyday Recollecting ················································ 21 Elise van den Hoven Remembrance Home: Storage for Re-discovering One's Life ······································· 25 Yasuyuki Kono, Kaoru Misaki An Object-centric Storytelling Framework Using Ubiquitous Sensor Technology ·········· 31 Norman Lin, Kenji Mase, Yasuyuki Sumi Storing and Replaying Experiences in Mixed Environments using Hypermedia ··········· 35 Nuno Correia, Luis Alves, Jorge Santiago, Luis Romero Storing, Indexing and Retrieving My Autobiography ················································ 41 Alberto Frigo Sharing Experience and Knowledge with Wearable Computers ·································· 47 Marcus Nilsson, Mikael Drugge, Peter Parnes Sharing Multimedia and Context Information between Mobile Terminals ···················· 55 Jani Mäntyjärvi, Heikki Keränen, Tapani Rantakokko ix Using an Extended Episodic Memory Within a Mobile Companion ····························· 59 Alexander Kröner, Stephan Baldes, Anthony Jameson, Mathias Bauer u-Photo: A Design and Implementation of a Snapshot Based Method for Capturing Contextual Information ························································································ 67 Takeshi Iwamoto, Genta Suzuki, Shun Aoki, Naohiko Kohtake, Kazunori Takashio, Hideyuki Tokuda The Re: living Map - an Effective Experience with GPS Tracking and Photographs ······· 73 Yoshimasa Niwa, Takafumi Iwai, Yuichiro Haraguchi, Masa Inakage Relational Analysis among Experiences and Real World Objects in the Ubiquitous Memories Environment ······································································································ 79 Tatsuyuki Kawamura, Takahiro Ueoka, Yasuyuki Kono, Masatsugu Kidode A Framework for Personalizing Action History Viewer ·············································· 87 Masaki Ito, Jin Nakazawa, Hideyuki Tokuda Providing Privacy While Being Connected ······························································ 95 Natalia A. Romero, Panos Markopoulos Capturing Conversational Participation in a Ubiquitous Sensor Environment ···································································································· 101 Yasuhiro Katagiri, Mayumi Bono, Noriko Suzuki x Collaborative Capturing and Interpretation of Interactions Yasuyuki Sumi†‡ Sadanori Ito‡ Tetsuya Matsuguchi‡ß Sidney Fels¶ Kenji Mase§‡ †Graduate School of Infomatics, Kyoto University ‡ATR Media Information Science Laboratories ¶The University of British Columbia §Information Technology Center, Nagoya University ßPresently with University of California, San Francisco sumi@i.kyoto-u.ac.jp, http://www.ii.ist.i.kyoto-u.ac.jp/˜sumi ABSTRACT This paper proposes a notion of interaction corpus, a captured collection of human behaviors and interactions among humans and artifacts. Digital multimedia and ubiquitous sensor technologies create a venue to capture and store interactions that are automatically annotated. A very large-scale accumulated corpus provides an important infrastructure for a future digital society for both humans and computers to understand verbal/nonverbal mechanisms of human interactions. The interaction corpus can also be used as a well-structured stored experience, which is shared with other people for communication and creation of further experiences. Our approach employs wearable and ubiquitous sensors, such as video cameras, microphones, and tracking tags, to capture all of the events from multiple viewpoints simultaneously. We demonstrate an application of generating a video-based experience summary that is reconfigured automatically from the interaction corpus. KEYWORDS: interaction corpus, experience capturing, ubiquitous sensors INTRODUCTION Weiser proposed a vision where computers pervade our environment and hide themselves behind their tasks[1]. To achieve this vision, we need a new HCI (HumanComputer Interaction) paradigm based on embodied interactions beyond existing HCI frameworks based on desktop metaphor and GUIs (Graphical User Interfaces). A machine-readable dictionary of interaction protocols among humans, artifacts, and environments is necessary as an infrastructure for the new paradigm. As a first step, this paper proposes to build an interaction corpus, a semi-structured set of a large amount of interaction data collected by various sensors. We aim to use this corpus as a medium to share past experiences with others. Since the captured data is segmented into primitive behaviors and annotated semantically, it is easy to collect the action highlights, for example, to generate a reconstructed diary. The corpus can, of course, also serve as an infrastructure for researchers to analyze and model social protocols of human interactions. Our approach for the interaction corpus is characterized by the integration of many sensors (video cameras and microphones), ubiquitously set up around rooms and outdoors, and wearable sensors (video camera, microphone, and physiological sensors) to monitor humans as the subjects of interactions1 . More importantly, our system incorporates ID tags with an infrared LED (LED tags) and infrared signal tracking device (IR tracker) in order to record positional context along with audio/video data. The IR tracker gives the position and identity of any tag attached to an artifact or human in its field of view. By wearing an IR tracker, a user’s gaze can also be determined. This approach assumes that gazing can be used as a good index for human interactions[2]. We also employ autonomous physical agents, like humanoid robots[3], as social actors to proactively collect human interaction patterns by intentionally approaching humans. Use of the corpus allows us to relate the captured event to interaction semantics among users by collaboratively processing the data of users who jointly interact with each other in a particular setting. This can be performed without time-consuming audio and image processing as long as the corpus is well prepared with finegrained annotations. Using the interpreted semantics, we also provide an automated video summarization of 1 Throughout this paper, we use the term “ubiquitous” to describe sensors set up around the room and “wearable” to specify sensors carried by the users. individual users’ interactions to show the accessibility of our interaction corpus. The resulting video summary itself is also an interaction medium for experience-sharing communication. CAPTURING INTERACTIONS BY MULTIPLE SENSORS We developed a prototype a system for recording natural interactions among multiple presenters and visitors in an exhibition room. The prototype was installed and tested in one of the exhibition rooms during our twoday research laboratories’ open house. Wireless connection Wearable sensors Stationary sensors IR tracker Headset microphone Portable Capturing PC m Stationary Capturing PC n Physiological sensors : : : : Portable Capturing PC 1 Stationary Capturing PC 1 IR tracker Stationary camera Stationary microphone IR tracker Head-mounted camera Headset microphone Physiological sensors Raw AV data SQL DB Captured data server Tactile sensors Stationary camera Stationary microphone Application server Omni-directional camera Stereo cameras There have been many works on smart environments for supporting humans in a room by using video cameras set around the room, e.g., the Smart rooms[4], Intelligent room[5], AwareHome[6], Kidsroom[7], and EasyLiving[8]. The shared goal of these works was recognition of human behavior using computer vision techniques and understanding of the human’s intention. On the other hand, our interest is to capture not only an individual human’s behavior but also interactions among multiple humans (networking of their behaviors). We then focus on the understanding and utilization of human interactions by employing an infrared ID system to simply identify the human’s existence. Ethernet connection IR tracker Head-mounted camera RELATED WORKS IR tracker Humanoid robot Head-mounted camera Headset microphone Ultrasonic sensors Communication robot Figure 1: Architecture of the system for capturing interactions. Figure 1 illustrates the system architecture for collecting interaction data. The system consists of sensor clients ubiquitously set up around the room and wearable clients to monitor humans as subjects of interactions. Each client has a video camera, microphone, and IR tracker, and sends the data to the central data server. Some wearable clients have physiological sensors. There also have been works on wearable systems for collecting personal daily activities by recording video data, e.g., [9] and [10]. Their aim was to build an intelligent recording system used by single users. We, however, aim to build a system collaboratively used by multiple users to capture their shared experiences and promote their further creative collaborations. By using such a system, our experiences can be recorded by multiple viewpoints and individual viewpoints will become obvious. This paper shows a system that automatically generates video summaries for individual users as an application of our interaction corpus. In relation to this system, some systems to extract important scenes of a meeting from its video data were proposed, e.g., [11]. These systems extract scenes according to changes in the physical quantity of video data captured by fixed cameras. On the other hand, our interest is not to detect the changes of visual quantity but to segment human interactions (perhaps derived by the humans’ intentions and interests), and then extract scene highlights from a meeting naturally. IMPLEMENTATION Figure 2 is a snapshot of the exhibition room set up for recording an interaction corpus. There were five booths in the exhibition room. Each booth had two sets of ubiquitous sensors that include video cameras with IR trackers and microphones. LED tags were attached to possible focal points for social interactions, such as on posters and displays. Principal data is video data sensed by camera and microphone. Along the video stream data, IDs of the LED tag captured by the IR trackers and physiological data are recorded in the database as indices of the video data. Each presenter at their booth carried a set of wearable sensors, including a video camera with an IR tracker, a microphone, an LED tag, and physiological sensors (heart rate, skin conductance, and temperature). A visitor could choose to carry the same wearable system as the presenters, just an LED tag, or nothing at all. The humanoid robots in the room record their own behavior logs and the reactions of the humans with whom the robots interact. One booth had a humanoid robot for its demonstration that was also used as an actor to interact with visitors and record interactions using the same wearable system Ubiquitous sensors (video camera, microphone, IR tracker) LED tags attached to objects Video camera, IR tracker, LED tag Humanoid robot Microphone PC Figure 2: Setup of the ubiquitous sensor room. as the human presenters. The clients for recording the sensed data were Windowsbased PCs. In order to incorporate data from multiple sensor sets, time is an important index. We installed NTP (Network Time Protocol) to all the client PCs to synchronize their internal clocks within 10ms. LED tag Micro computer Recorded video data were gathered to a UNIX file server via samba server. Index data given to the video data were stored in an SQL server (MySQL) running on another Linux machine. In addition, we had another Linuxbased server, called an application server, for generating a video-based summary by using MJPEG Tools2 . LED CMOS camera for ID tracking IR tracker CCD camera for video recording At each client PC, video data was encoded into MJPEG (320 x 240 resolution, 15 frames per second) and audio data was recorded in PCM 22 KHz 16 bit monaural. Figure 3: IR tracker and LED tag. Figure 3 shows the prototyped IR tracker and LED tag. The IR tracker consists of a CMOS camera for detecting blinking signals of LED and a micro computer for controlling the CMOS camera. The IR tracker was embedded in a small box with another CCD camera for recording video contents. ognize IDs of LED tags within their view in the range of 2.5 meters, and send the detected IDs to the SQL server. Each tracker data consists of spatial data, the two-dimensional coordinate of the tag detected by the IR tracker, and temporal data, the time of detection, in addition to the ID of the detected tag (see Figure 4). Each LED tag emits a 6-bit unique ID, allowing for 64 different IDs, by rapidly flashing. The IR trackers rec- A few persons attached three types of physiological sensors – a pulse physiology sensor, skin conductance sensor, and temperature sensor – to their fingers3 These 2 A set of tools that can do cut-and-paste editing and MPEG compression of audio and video under Linux. http://mjpeg.sourceforge.net 3 We used Procomp+ as an AD converter for transmitting X Y 4 1036571603.137000 61 229 60 1036571603.448000 150 29 4 1036571603.878000 61 228 60 1036571604.319000 149 28 4 1036571604.659000 62 227 60 1036571605.440000 152 31 60 1036571605.791000 150 28 60 1036571606.131000 148 30 4 1036571606.472000 64 230 60 1036571607.163000 150 30 60 1036571608.074000 150 30 60 1036571608.385000 148 29 60 1036571608.725000 146 28 4 1036571609.066000 65 228 ID 60 4 TIME Coexistence Staying Gazing at an object Joint attention Attention Focus: Socially important event Conversation IR tracker’s view LED tag Figure 4: Indexing by visual tags. Figure 5: Interaction primitives. data were also sent to the SQL server via the PC. Eighty users participated during the two-day open house providing ∼ 300 hours of video data, 380,000 tracker data along with associated physiological data. The major advantage of the system is the relatively short time required in analyzing tracker data compared to processing audio and images of all the video data. the users jointly pay attention to the object. When many users pay attention to the object, we infer that the object plays a socially important role at that moment. facing Two users’ IR trackers detect each others’ LED tags: they are facing each other. INTERPRETING INTERACTIONS To illustrate how our interaction corpus may be used, we constructed a system to provide users with a personal summary video at the end of their touring of an exhibition room on the fly. We developed a method to segment interaction scenes from the IR tracker data. We defined interaction primitives, or “events”, as significant intervals or moments of activities. For example, a video clip that has a particular object (such as a poster, user, etc.) in it constitutes an event. Since the location of all objects is known from the IR tracker and LED tags, it is easy to determine these events. We then interpret the meaning of events by considering the combination of objects appearing in the events. Figure 5 illustrates basic events that we considered. stay A fixed IR tracker at a booth captures an LED tag attached to a user: the user stays at the booth. coexist A single IR tracker captures LED tags attached to different users at some moment: the users coexist in the same area. gaze An IR tracker worn by a user captures an LED tag attached to someone/something: the user gazes at someone/something. attention An LED tag attached to an object is simultaneously captured by IR trackers worn by two users: sensed signals to the carried PC. Raw data from IR trackers are just a set of intermittently detected IDs of LED tags. Therefore, we first group the discrete data into interval data implying that a certain LED tag stays in view for a period of time. Then, these interval data are interpreted as one of the above events according to the combination of entities attached by the IR tracker and LED tag. In order to group the discrete data into interval data, we assigned two parameters, minInterval and maxInterval. A captured event is at least minInterval in length, and times between tracker data that make up the event are less than maxInterval. The minInterval allows elimination of events too short to be significant. The maxInterval value compensates for the low detection rate of the tracker; however, if the maxInterval is too large, more erroneous data will be utilized to make captured events. The larger the minInterval and the smaller the maxInterval are, the fewer the significant events that will be recognized. For the first prototype, we set both the minInterval and maxInterval at 5 sec. However, a 5 sec maxInterval was too short to extract events having a meaningful length of time. As a result of the video analyses, we found an appropriate value of maxInterval: 10 sec for ubiquitous sensors and 20 sec for wearable sensors. The difference of maxInterval values is reasonable because ubiquitous sensors are fixed and wearable sensors are moving. TALKED WITH I talked with [someone]. VIDEO SUMMARY We were able to extract appropriate “scenes” from the viewpoints of individual users by clustering events having spatial and temporal relationships. Time Talk to A Talk to B Talk to C Visit Z Visit X Look into W Visit Y Talk to A about Z Talk to B & C about Y Watch W at X Figure 6: Interpreting events to scenes by grouping spatio-temporal co-occurences. A scene is made up of several basic interaction events and is defined based on time. Because of the setup of the exhibition room, in which five separate booths had a high concentration of sensors, scenes were locationdependent to some extent as well. Precisely, all the events that overlap at least minInterval / 2 were considered to be a part of the same scene (see Figure 6). Scene videos were created in a linear time fashion using only one source of video at a time. In order to decide which video source to use to make up the scene video, we established a priority list. In creating the priority list, we made a few assumptions. One of these assumptions was that the video source of a user associated with a captured event of UserA shows the close-up view of UserA. Another assumption was that all the components of the interactions occurring in BoothA are captured by the ubiquitous cameras set up for BoothA. The actual priority list used was based on the following basic rules. When someone is speaking (the volume of the audio is greater than 0.1 / 1.0), a video source that shows the close-up view of the speaker is used. If no one that is involved in the event is speaking, the ubiquitous video camera source is used. Figure 7 shows an example of video summarization for a user. The summary page was created by chronologically listing scene videos, which were automatically extracted based on events (see above). We used thumbnails of the scene videos and coordinated their shading based on the videos’ duration for quick visual cues. The system provided each scene with annotations, i.e., time, description, and duration. The descriptions were automatically determined according to the interpretation of extracted interactions by using templates, as follows. WAS WITH I was with [someone]. LOOKED AT I looked at [something]. In the time intervals where more than one interaction event has occurred, the following priority was used: TALKED WITH > WAS WITH > LOOKED AT. We also provided a summary video for a quick overview of the events the users experienced. To generate the summary video, we used a simple format in which at most 15 seconds of each relevant scene was put together chronologically with fading effects between the scenes. The event clips used to make up a scene were not restricted to those captured by a single resource (video camera and microphone). For example, for a summary of a conversation TALKED WITH scene, the video clips used were recorded by the camera worn by the user him/herself, the camera of the conversation partner, and a fixed camera on the ceiling that captured both users. Our system selects which video clips to use by consulting the volume levels of the users’ individual voices. The worn LED tag is assumed to indicate that the user’s face is in the video clip if the associated IR tracker detects it. Thus, the interchanging integration of video and audio from different worn sensors could generate a scene of a speaking face by camera with a clearer voice by his/her microphone. CORPUS VIEWER: TOOL FOR ANALYZING INTERACTION PATTERNS The video summarizing system was intended to be used as an end-user application. Our interaction corpus is also valuable for researchers to analyze and model human social interactions. In such a context, we aim to develop a system that researchers (HCI designers, social scientists, etc.) can query for specific interactions quickly with simple commands that provides enough flexibility to suit various needs. To this end, we prototyped a system called the Corpus Viewer, as shown in Figure 8. This system first visualizes all interactions collected from the viewpoint of a certain user. The vertical axis is time. Vertical bars correspond to IR trackers (red bars) that capture the selected user’s LED tag and LED tags (blue bars) that are captured by the user’s IR tracker. Many horizontal lines on the bars imply IR tracker data. By viewing this, we can easily grasp an overview of the user’s interactions with other users and exhibits, such as mutual gazing with other users and staying at a certain booth. The viewer’s user can then select any part of the bars to extract a video corresponding to the selected time and viewpoint. Summary video of the user’s entire visit List of highlighted scenes during the user’s visit Annotations for each scene: time, description, duration Video example of conversation scene Overhead camera Partner’s camera Self camera Figure 7: Automated video summarization. We have just started to work together with social scientists to identify patterns of social interactions in the exhibition room using our interaction corpus augmented by the Corpus Viewer. The social scientists actually used our system to roughly estimate sufficient points from a large amount of data by browsing clusters of IR tracking data. CONCLUSIONS This paper proposed a method to build an interaction corpus using multiple sensors either worn or placed ubiquitously in the environment. We built a method to segment and interpret interactions from huge collected data in a bottom-up manner by using IR tracking data. At the two-day demonstration of our system, we were able to provide users with a video summary at the end of their experience on the fly. We also developed a prototype system to help social scientists analyze our interaction corpus to learn social protocols from the interaction patterns. ACKNOWLEDGEMENTS We thank our colleagues at ATR for their valuable discussion and help on the experiments described in this paper. Valuable contributions to the systems described in this paper were made by Tetsushi Yamamoto, Shoichiro Iwasawa, and Atsushi Nakahara. We also would like to thank Norihiro Hagita, Yasuyhiro Katagiri, and Kiyoshi Kogure for their continuing support of our research. This research was supported in part by the Telecommunications Advancement Organization of Japan. REFERENCES 1. Mark Weiser. The computer for the 21st century. Scientific American, 265(30):94–104, 1991. 2. Rainer Stiefelhagen, Jie Yang, and Alex Waibel. Modeling focus of attention for meeting indexing. In ACM Multimedia ’99, pages 3–10. ACM, 1999. 3. Takayuki Kanda, Hiroshi Ishiguro, Michita Imai, Tetsuo Ono, and Kenji Mase. A constructive approach for developing interactive humanoid robots. In 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), pages 1265– 1270, 2002. 4. Alex Pentland. Smart rooms. 274(4):68–76, 1996. Scientific American, 5. Rodney A. Brooks, Michael Coen, Darren Dang, Jeremy De Bonet, Josha Kramer, Tomás Lozano-Pérez, John Mellor, Polly Pook, Chris Stauffer, Lynn Stein, Mark Torrance, and Michael Wessler. The intelligent room project. In Proceedings of the Second International Cognitive Technology Conference (CT’97), pages 271–278. IEEE, 1997. 6. Cory D. Kidd, Robert Orr, Gregory D. Abowd, Christopher G. Atkeson, Irfan A. Essa, Blair MacIntyre, Elizabeth Mynatt, Thad E. Startner, and Wendy Newstetter. The aware home: A living laboratory for ubiqui- 1. Selecting a period of time to extract video 3. Viewing the extracted video 2. Confirmation and adjustment of the selected target Figure 8: Corpus viewer for facilitating an analysis of interaction patterns. tous computing research. In Proceedings of CoBuild’99 (Springer LNCS1670), pages 190–197, 1999. 7. Aaron F. Bobick, Stephen S. Intille, James W. Davis, Freedom Baird, Claudio S. Pinhanez, Lee W. Campbell, Yuri A. Ivanov, Arjan Schütte, and Andrew Wilson. The KidsRoom: A perceptually-based interactive and immersive story environment. Presence, 8(4):369–393, 1999. 8. Barry Brumitt, Brian Meyers, John Krumm, Amanda Kern, and Steven Shafer. EasyLiving: Technologies for intelligent environments. In Proceedings of HUC 2000 (Springer LNCS1927), pages 12–29, 2000. 9. Steve Mann. Humanistic intelligence: WearComp as a new framework for intelligence signal processing. Proceedings of the IEEE, 86(11):2123–2125, 1998. 10. Tatsuyuki Kawamura, Yasuyuki Kono, and Masatsugu Kidode. Wearable interfaces for a video diary: Towards memory retrieval, exchange, and transportation. In The 6th International Symposium on Wearable Computers (ISWC2002), pages 31–38. IEEE, 2002. 11. Patrick Chiu, Ashutosh Kapuskar, Sarah Reitmeier, and Lynn Wilcox. Meeting capture in a media enriched conference room. In Proceedings of CoBuild’99 (Springer LNCS1670), pages 79–88, 1999. Context Annotation for a Live Life Recording Nicky Kern, Bernt Schiele Perceptual Computing and Computer Vision ETH Zurich, Switzerland {kern,schiele}@inf.ethz.ch Holger Junker, Paul Lukowicz, Gerhard Tröster Wearable Computing Lab ETH Zurich, Switzerland {junker,lukowicz,troester} @ife.ee.ethz.ch ABSTRACT We propose to use wearable sensors and computer systems to generate personal contextual annotations in audio-visual recordings of a person’s life. In this paper we argue that such annotations are essential and effective to allow retrieval of relevant information from large audio-visual databases. The paper summarizes work on automatically annotating meeting recordings, extracting context from body-worn acceleration sensors alone, and combining context from three different sensors (acceleration, audio, location) for estimating the interruptability of the user. These first experimental results indicate, that it is possible to automatically find useful annotations for a lifetime’s recording and discusses what can be reached with certain sensors and sensor configurations. INTRODUCTION Interestingly, about 500 Tera Bytes of storage are sufficient to record all audio-visual information a person perceives during an entire lifespan1 . This amount of storage will be available even for an average person in the not so distant future. A wearable recording and computing device therefore might be used to ’remember’ any talk, any discussion, or any environment the person saw. For annotating an entire life-time it is important that the recording device with the attached sensors can be worn in any situation by the user. Although it is possible to augment certain environments, this will not be sufficient. Furthermore wearable computers allow a truly personal audio-visual record of the environment of a person in any environment. Using a hat- or glassmounted camera and microphones attached to the chest 1 assuming a lifespan of 100 years, 24h recording per day, and 10 MB per min recording results in approximately 500 TB Albrecht Schmidt Media Informatics Group Universität München albrecht.schmidt@acm.org or shoulders of the person enable a recording from a first-person perspective. Today however, the usefulness of such data is limited by the lack of adequate methods for accessing and indexing large audio-visual databases. While humans tend to remember events by associating them with personal experience and contextual information, today’s archiving systems are based solely on date, time, location and simple content classification. As a consequence even in a recording of a simple event sequence such as a short meeting, it is very difficult for the user to efficiently retrieve relevant events. Thus for example the user might remember a particular part of the discussion as being a heated exchange conducted during a short, unscheduled coffee break. However he is unlikely to remember the exact time of this discussion which is typically required today to retrieve in audio-visual recordings. In this paper we propose to use wearable sensors in order to enhance the recorded data with contextual, personal information to facilitate user friendly retrieval. Sensors, such as accelerometers and biometric sensors, can enhance the recording with information on the user’s context, activity and physical state. That sensor information can be used to annotate and structure the data stream for later associative access. This paper summarizes three papers [1, 2, 3] in which we have worked towards extracting such context and specifically using it for retrieving information. The second section of this paper summarizes [1], in which context annotations from audio and acceleration are used to annotate meeting recordings. The third section introduces [2], in which context from audio, acceleration and location is used to mediate notifications to the user. The fifth section examines in detail how much information can be extracted from acceleration sensors alone. A discussion of these three in the context of life recording concludes the paper. RELATED WORK Recently, the idea of recording an entire lifetime of information has received great attention. The UK Computing Research Committee formulated as part of the Grand Challenges Initiative a number of issues arising from recording a lifetime [4]. Microsoft’s MyLifeBits [5] Discussion - Sitting - Two speakers Presentation - Standing - One speaker Figure 1: Retrieval Application for a Meeting, presentation and discussion parts highlighted, graphs from top to bottom: Audio signal, speaker recognition first speaker, speaker recognition second speaker, Me vs. the World , Activity Recognition project tries to collect and store any digital information about a person, but leaves the annotation to the user. Finally, DARPA’s LifeLog initiative [6] invites researchers to investigate the issues of data collection, automatic annotation, and retrieval. The idea of computer-based support for human memory and retrieval is not new. Lamming and Flynn for example point out the importance of context as a retrieval key [7] but only used cues like location, phone calls, and interaction between different PDAs. The conference assistant [8] supports the organization of a conference visit, annotation of talks and discussions, and retrieval of information after the visit. Again, the cooperation and communication between different wearables and the environment is an essential part of the system. Rhodes proposed the text-based remembrance agent [9] to help people to retrieve notes they previously made on their computer. For speech recognition the automatic speech transcription of meetings is an extremely challenging task due to overlapping and spontaneous speech, large vocabularies, and difficult background noise [10, 11]. Often, multiple microphones are used such as close-talking, table microphones, and microphone arrays. The SpeechCorder project [12] for example aims to retrieve information from roughly transcribed speech recorded during a meeting. Summarization is another topic, which is currently under investigation in speech recognition [13] as well as video processing. We strongly believe, however, that summarization is not enough to allow effective and in particular associative access to the recorded data. Richter and Le [14] propose a device which will use predefined commands to record conversations and take low-resolution photos. At the university of Tokyo [15] researchers investigate the possibilities to record subjective experience by recording audio, video, as well as heartbeat or skin conductance so as to recall one’s experience from various aspects. StartleCam [16] is a wearable device which tries to mimic the wearer’s selective memory. The WearCam idea of Mann [17] is also related to the idea of constantly recording one’s visual environment. WEARABLE SENSING TO ANNOTATE MEETING RECORDINGS In order to give first experimental evidence that context annotations are useful, we recorded meetings and annotated them using audio and acceleration sensors [1]. In particular, we extracted information such as walking, standing, and sitting from the acceleration sensors, and speaker changes from the audio. Thus we facilitate the associative retrieval of the information in the meetings. Looking at the meeting scenario we have identified four classes of relevant annotations. Those are different meeting phases, flow of discussion, user activity and reac- To detect different speakers and find speaker changes, we have implemented a HMM-based speaker segmentation algorithm, based on [18]. A model is trained for every speaker using labelled training data, final segmentation is done by combining all models. First results yielded some 85-95% recognition rate. We proposed a scheme to facilitate retrieval using these segmentation rates by trading error rate against time accuracy of the segmentation. interruption no problem Boring Talk Bar Sitting in a Tram Skiing Having a coffee Walking in Restaurant the street Riding a bike Lecture Driving a car don't disturb Audio Context. By clipping the recording microphone to the collar of the user, we can tell the user from the rest of world by thresholding the energy of the audio signal. This allows us to further increase the recognition rate of the speaker segmentation. Waiting Room interruption ok Personal Interruptability don't disturb tions, and interactions between the participants. The meeting phase includes the time of presentations, breaks, and when somebody is coming or leaving during the meeting. The flow of discussion annotations attach speaker identity and changes to the audio stream, and indicate the level of intensity of discussion. It can also help to differentiate single person presentations, interactive questions and answers, and heated debate. User activity and reactions indicate user’s level of interest, focus of attention, and agreement or disagreement with particular issues and comments. By tracking the interaction of the user with other participants personal discussions can be differentiated from general discussions. interruption ok interruption no problem Social Interruptability Figure 2: Personal and Social Interruptability of the User CONTEXT–AWARE NOTIFICATION FOR WEARABLE COMPUTING For the automatic mediation of notifications, we have investigated the inference of complex context information (namely the personal and social interruptability of the user) from simpler contexts, such as user activity, social situation, and location [2]. To show how to use our proposed annotations, we have recorded a set of three short meetings (3-4 min) and evaluated both the speaker identification and the activity recognition on them. The recognition results for the respective sensors are similar to those obtained in our previous experiments. The cost of a notification mainly depends on the interruptability of the user. However, we have to distinguish between the interruptability of the user and of his environment. We refer to the Personal Interruptability as the interruptability of the user. With the term Social Interruptability we indicate the interruptability of the environment of the user. These two interruptabilities are depicted in a two-dimensional space (see Figure 2). Considering the ‘lecture’ situation in Figure 2, the user is little interruptible, because he follows the lecture, and his environment is equally little interruptible. However, if the lecture was boring, the user would probably appreciate an interruption, while the environment should still not be interrupted. As shown in Figure 3, this space can also be used to select notification modalities, by discretizing it and assigning a notification modality to every bin. Figure 1 shows a screen shot of our retrieval application. The audio stream is displayed on top, followed by the results of the speaker identification algorithm (in blue the ground truth). The “Me Vs. The World” row shows the result of the energy thresholding algorithm. Finally the bottom most block shows the activity of the user. We can clearly tell the presentation phase of the meeting from the discussion phase by looking at both the number of speaker changes and the fact that the presenter is standing during the presentation. We use three different sensors, namely acceleration, audio, and location, from which we extract low-level context information. We use a single dual-axis accelerometer mounted above the user’s knee, and classify its data into walking, sitting, standing, and stairs. The audio data is classified into street, restaurant, conversation, lecture, and other. Finally, we use the closest wireless LAN access point as location information. We have grouped the available access points into Office, Lab, Lecture Hall, Cafeteria, and Outside. Acceleration Context. We use two 3D-accelerometers to detect the user’s activity. The sensors are attached above the right knee and on the right wrist of the user. We classify the user’s activity in sitting, walking, standing, and shaking hands. The first three tell a lot about the user’s activity, while the last one allows to find interaction with others. First experiments of our HMM-based classifier yielded some 88-98% recognition score. grab entire attention make aware HMD + Vibration Vibration + Watch Beep + HMD Beep Speech + HMD Ring don't notify make aware grab entire attention Acc: Sitting 3 Acc: Standing 3 Acc: Walking 3 Acc: Stairs 2 2 2 1 1 1 0 0 1 2 3 Audio: Conversation 3 0 0 2 2 2 2 2 1 1 1 1 1 0 0 1 2 3 Location: Lecture Hall 3 0 0 1 2 3 Location: Cafeteria 3 0 0 3 don't notify Intensity for the User 3 1 2 3 Location: Office 3 0 0 3 1 2 3 Audio: Restaurant 1 2 Location: Lab 3 0 0 3 0 0 3 2 1 1 2 Audio: Street 3 0 0 3 1 2 3 Location: Outdoor 1 2 3 Audio: Lecture 3 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0 0 1 2 3 1 2 3 1 2 3 1 2 3 0 0 Audio: Other 1 2 3 Figure 4: Tendencies for Combining Low-Level Contexts into the Interruptability of the User Intensity for the Environment Figure 3: Selecting Notification Modalities using the User’s Social and Personal Interruptability We found that modelling situations such as the ones in the Figure 2 is inappropriate to estimate the user’s interruptability from sensor data. Situations are too general and thus their corresponding interruptability cover too large an area in the space. Increasing the level of detail of the situations would help, but make the number of situations unmanageable. Instead we infer the interruptability directly from lowlevel sensors. For each context, we define a tendency where the interruptability is likely to be within the interruptability space. See Figure 4 for the tendencies we used in our experiments. The final interruptability is then found by weighing the tendencies with the respective sensor recognition score, summing all tendencies together and finding the maximum within the interruptability space. We have experimentally shown the feasibility of the approach on a 37min stretch of data for all three modalities. We used the tendencies depicted in Figure 4. Since we wanted to use the interruptability for notification modality selection, we consider the error sufficiently small, if the interruptability was within the same ‘bin’ of the 3x3 grid. Using this error measure, we could estimate the Personal Interruptability sufficiently well in 96.3% of the time, and the Social Interruptability for 88.5% of the time. The Social Interruptability mainly depends on the audio classification, which, in itself, had a lower recognition score than the acceleration classification. MULTI–SENSOR ACTIVITY CONTEXT DETECTION FOR WEARABLE COMPUTING We have started the investigation how much information can be extracted from acceleration sensors alone [3]. In particular we investigated the number of sensors required for detecting a certain context and their best placement. To this end we have developed a hardware platform, that allows to take acceleration readings from 12 positions on the user’s body. We investigated both simple activities such as sitting, walking, standing, or walking stairs up and down, and more complex ones, such as shaking hands, typing on a keyboard, and writing on a white board. The sensors were attached to all major body joints, namely both shoulders, elbows, wrists, both sides of the hip, both knees and ankles. We recorded some 19 minutes of data for the above activities. The data was classified using a Naı̈ve Bayes’ classifier, using 5-fold cross validation. Figure 5 shows some recognition results for different sub-sets of sensors. The right-most sets of bars show that the recognition rate decreases with the number of sensors that are used. The recognition rates for the ‘leg-only’ activities are very similar for all sets of sensors, with the exception of the set of the upper body sensors, which seems obvious. More detailed experiments [3] show that reducing the number of sensors either works well for simpler contexts (such as walking, sitting, standing) or reduces the recognition score. Depending on the complexity of the activity the drop in recognition score can be slight, e.g. for activities of medium complexity such as upstairs or downstairs, or significant for complex or subtle activities, such as shaking hands or typing on a keyboard. DISCUSSION AND OUTLOOK Recording part of or even an entire lifetime is becoming feasible in the near future. Retrieval within and structuring of such large data collections is a critical challenge for this vision to come true. We propose to use wearable sensor and computer systems to annotate the recorded data automatically with personal information, and allow for associative retrieval. We present three applications in this context. Firstly, a Meeting Recorder that automatically annotates recordings using context from body-worn acceleration and audio sensors. Sec- Leg-Only Activities 100 90 80 70 60 50 40 30 20 10 0 All S ens ors R ight Left Upper B ody Lower B ody S itting S tanding Walking Ups tairs Downs tairs Average Leg-Only Other Activities 100 90 80 70 60 50 40 30 20 10 0 All S ens ors R ight Left Upper B ody Lower B ody S hake Hands Write on B oard K eyboard T yping Average all Activities Figure 5: Recognize User Activity from Body-Worn Acceleration Sensors. Recognition Rates for Different Sub-Sets of Sensors ondly, we have used low-level context information from acceleration, audio, and location, to estimate the user’s social and personal interruptability — a high-level context information that can both be used for retrieval and to drive a context-aware application. Thirdly, we have investigated how much information can be gathered from acceleration sensors alone, specifically how many sensors are required and where they could be placed for the recognition of a certain context. With the technology presented, we can capture personal information of the user. This personal information can be extended in two ways, either using the user’s interactions with other users or using his digital footprint in the environment. The user’s interaction with others could be detected using the physical presence of the other user’s personal device, and could for example be used to find a specific discussion with the other user. The user’s digital footprint includes not only e-mail, but also his interaction with other electronic devices such as printers, beamers, etc. Projects such as Microsoft MyLifeBits [5] currently concentrate on collecting all digitally accessible information from the user’s environment such as telephone calls, letters, e-mails, etc., and making it accessible by explicit user annotation. This is complemented by other initiatives, such as DARPA’s LifeLog, which rather focus on sensory augmented wearable technologies, and tries, to automatically find structure (events and episodes) in the data to facilitate subsequent retrieval. While the work presented fits very well in the latter direction, it is but a first step towards recording a lifetime. REFERENCES 1. N. Kern, B. Schiele, H. Junker, P. Lukowicz, and G. Tröster. Wearable sensing to annotate meetings recordings. In Proc. ISWC, pages 186–193, 2002. 2. N. Kern and B. Schiele. Context–aware notfication for wearable computing. In Proc. ISWC, pages 223–230, White Plains, NY, USA, October 2003. 3. N. Kern, B. Schiele, and A. Schmidt. Multi–sensor activity context detection for wearable computing. In Proc. EUSAI, LNCS, volume 2875, pages 220–232, Eindhoven, The Netherlands, November 2003. 4. Memories for life, CRC Grand Challenges Initiative. http://www.csd.abdn.ac.uk/ ereiter/memories.html. 5. Microsoft MyLifeBits project. http://research.microsoft.com/barc/mediapresence/MyLifeBits.aspx. 6. DARPA LifeLog initiative. http://www.darpa.mil/ipto/programs/lifelog/. 7. M. Lamming and M. Flynn. Forget-me-not: intimate computing in support of human memory. In FRIENDS21, pages 125–128, 1994. 8. A.K. Dey, D. Salber, G.D. Abowd, and M. Futakawa. The conference assistant: Combining context-awareness with wearable computing. In ISWC, pages 21–28, 1999. 9. B. Rhodes. The wearable remembrance agent: A system for augmented memory. In ISWC, pages 123–128, 1997. 10. ICSI Berkeley, The Meeting Recorder Project at ICSI. http://www.icsi.berkeley.edu/Speech/mr/. 11. NIST Automatic Meeting Transcription Project. http://www.itl.nist.gov/iad/894.01/. 12. A. Janin and N. Morgan. Speechcorder, the portable meeting recorder. In Workshop on Hands-Free Speech Communication, 2001. 13. A. Waibel, M. Bett, and M. Finke. Meeting browser: Tracking and summarizing meetings. In Proceedings of the DARPA Broadcast News Workshop, 1998. 14. T. Kontzer. Recording your life. http://www.informationweek.com, Dec, 18 2001. 15. R. Ueoka, M. Hirose, K. Hirota, A. Hiyama, and A. Yamamura. Study of experience recording and recalling for wearable computer. Correspondences on Human Interface, 3(1):13–16, 2001.02. 16. J. Healey and R. Picard. Startlecam: A cybernetic wearable camera. In ISWC, pages 42–49, 1998. 17. S. Mann. Smart clothing: The wearable computer and wearcam. Personal Technologies, 1(1), 1997. 18. D. Kimber and L. Wilcox. Acoustic segmentation for audio browsers. In Proc. Interface Conference, 1996. Capture and Efficient Retrieval of Life Log Kiyoharu Aizawa, Tetsuro Hori, Shinya Kawasaki, Takayuki Ishikawa Department of Frontier Informatics, The University of Tokyo +81-3-5841-6651 {aizawa,t_hori, kawasaki,ishikawa}@hal.t.u-tokyo.ac.jp ABSTRACT In ``Wearable computing'' environments, digitization of personal experiences will be made possible by continuous recording using a wearable video camera. This could lead to ``automatic life-log application''. It is evident that the resulting amount of video content will be enormous. Accordingly, to retrieve and browse desired scenes, a vast quantity of video data must be organized using structural information. In this paper, we are developing a ``contextbased video retrieval system for life-log applications''. This system can capture not only video and audio but also various sensor data and provides functions that make efficient video browsing and retrieval possible by using data from these sensors, some databases and various document data. Keywords life log, retrieval, context, wearable INTRODUCTION The custom of writing a diary is common all over the world. This fact shows that many people like to log their everyday lives. However, to write a complete diary, a person must recollect and note what was experienced without missing anything. For an ordinary person, this is impossible. It would be nice to have a secretary who observed your everyday life and wrote your diary for you. In the future, a wearable computer may become such a secretary-agent. In this paper, we aim at the development of a ``life-log agent'' (that operates on a wearable computer). The life-log agent logs our everyday life on storage devices instead of paper, using multimedia such as a small camera instead of a pencil. There have been works to log a person's life in the area of mobile computing, wearable computing, video retrieval and database [1,2,3,8,9,10,11]. A person's experiences or activities have been captured from many different points of view. In one of the earliest works [7], various personal activities were recorded such as personal location and encounters with others, file exchange, workstation activities, etc. Diary recording using additional sensors have been attempted in the wearable computing area. For example, in [2], a person's skin conductivities were captured for video retrieval keys. In [11], not only wearable sensors, but also RFIDs for object identification were utilized. Meetings were also recorded using sensors for speaker identification [9]. In database area, Mylifebits project attempts to exhaustively records a person's activities such as document processing, web browsing etc. We focus on continuous capturing our experiences by wearable sensors including a camera. In our previous works [4,5], we used a person's brain waves and motion to retrieve videos. In this paper, we describe our latest work, which is able to retrieve using more contexts. PROBLEMS IN BROWSING LIFE-LOG VIDEO A life-log video can be captured using a small wearable camera with a field of view equivalent to the user's field of view. Videos are the most important contents of life-log data. By continuously capturing life-log videos, personal experiences of everyday life can be recorded by video, which is a most popular medium. Instead of writing a diary, a person can simply order the life-log agent to start capturing a life-log video at the beginning of every day. For a conventional written diary, a person can look back on a year at its end by reading the diary, and will soon finish reading the diary and will easily review events in the year. However, watching life-log videos is a critical problem. It would take another year to watch the entire life-log video for one year. Then, although it is surely necessary to digest or edit life-log videos, editing takes even more time. It is the most important to be able to process a vast quantity of video data automatically. Conventional Video Retrieval Systems Recently, a variety of systems for video retrieval has been existing. Conventional systems take content-based approach. They digest or edit videos by processing the various features grasped from image or audio signals. For example, they may utilize color histograms extracted from image signals. However, even if they utilize such information, computers do not understand the contents of the videos, and they can seldom help their users to easily retrieve and browse the desired scenes in life-log videos. In addition, such image signal processing requires very high computational costs. Our Proposed Solution to this Problem Life-log videos are captured by a user. Therefore, as the life-log video is captured, various data such as GPS, motion, etc. other than video and audio can be simultaneously recorded. By these information, computers may be able to use contexts as well as contents, thus, our approach is very different from conventional video retrieval technologies. CAPTURING SYSTEM The life-log agent is a system that can capture data from a wearable camera, a microphone and various sensors that show contexts. The sensors we used are a brain-wave analyzer, a GPS receiver, an acceleration sensor and a gyro sensor. All these sensors are attached to the notebook PC through, serial ports, USBs and PCMCIA slots. (figure 1 and figure 2) Next, using a modem, the agent can connect into the Internet almost anywhere via the PHS (Personal Handyphone System: Versatile cordless/mobile system developed in Japan.) network of NTT-DoCoMo. By referring to data on the Internet, the agent records ``the present weather in the user's location'', ``various news on that day, which were offered by some news sites or some email magazines'', ``all web pages (*.html) that the user browses'' and ``all emails that the user transmitted and received''. Figure 2. Capturing System At last, the agent monitors and controls the following applications, ``Microsoft Word'', ``Microsoft Excel'', ``Microsoft PowerPoint'' and ``Adobe Acrobat''. In addition to web browsing and transmission and reception of emails, these applications are the main softwares used while people are using computer. Because of monitoring and controlling them, when the user opens document a file (*.doc; *.xls; *.ppt; *.pdf) of such applications, the agent can order each application to copy the file and save it as text data. The user can use his cellular phone as a controller of operations ``start/stop life-log''. The agent recognizes the user's operations on his cellular phone via PHS. RETRIEVAL OF LIFE-LOG VIDEO We, human beings, save many experiences as a vast quantity of memories over many years of life while arranging and selecting them, and we can quickly retrieve and utilize necessary information from our memory. Some psychology researches say that we manage our memories based on contexts at the time. When we want to remember something, we can often use such contexts as keys, and recall the memories by associating them with these keys. Figure 1. Diagram of Capturing System For example, to recollect the scenes of a conversation, the typical keys used in the memory recollection process are such context information as ``what, where, with whom, when, how'' A user may put the following query (Query A). “On a cloudy day in mid-May when the Lower House general election was held, after making my presentation about lifelog, I was called to Shinjuku by the email from Kenji, and I talked with him while walking at a department store in Shinjuku. The conversation was very interesting! I want to see the scene to remember the contents of the conversation”. In conventional video retrieval the lowlevel features of image and audio signals of the videos are used as keys for retrieval. Probably, they will not be suitable for queries compatible with the way we query to our memories as in Query A. However, data from the brain-wave analyzer, the GPS receiver, the acceleration sensor, and the gyro sensor correlate highly with the user's contexts. The life-log agent estimates its user's contexts from these sensor data and some database, and uses them as keys for video retrieval. Thus, the agent retrieves life-log videos by imitating the way a person recollects experiences from his memories. It is conceivable that by using such context information, the agent can produce more accurate retrieval results than by using only audiovisual data. Moreover, each input from these sensors is a onedimensional signal, and the computational cost for processing them is low. Keys Obtained from Motion Data The life-log agent inputs the data of the acceleration sensor and the gyro sensor to the K-Means method and HMM and estimates the user's motion state. The details are in our previous paper[5]. In Query A, the conversation was held while the user was walking. Keys Obtained from Face Detection The life-log agent detects a person's face in life-log videos by processing the color histogram of the video image. Our method only uses very easy processing of the color histogram. Accordingly, even if there is no person in the image, when skin color is predominantly included, the agent make a wrong detection. But, the agent shows its user the frame images and the time of the scene in which the face was detected. If it is a wrong detection, the user can ignore it and can also delete it. If the image is detected correctly, the user can look at it and judge who it is. Therefore, identification of a face is unnecessary and simple detection is enough here. In Query A, the conversation was held when the user was with Kenji. Keys Obtained from Brain-Wave Data A sub-band [8-12 Hz] of brain waves is named α wave and it clearly shows the person's arousal status. When α wave is low [α-blocking], the person is in arousal, or in other words, is interested in or pays attention to something. We demonstrated that we can effectively retrieve a scene of interest to him using a person's brain waves in [4]. In Query A, the conversation was very interesting. Figure 4. Interface for managing videos Figure 5. A result of face detection Keys Obtained from GPS Data Figure 3. Interface for playing the video From the GPS signal, the life-log agent acquires information about the position of its user as longitude and latitude when capturing a life-log video. The contents of videos and the location information are automatically associated. Longitude and latitude information are onedimensional numerical data that identify positions on the Earth's surface relative to a datum position. Therefore, they are not intuitively readable for users. However, the agent can convert longitude and latitude into addresses with hierarchical structure using a special database, for example, ``7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan''. The results are information familiar to us, and we use them as keys for video retrieval. Latitude and longitude information also become information that we can intuitively understand by being plotted on a map as the footprints of the user, and thus become keys for video retrieval. ``What did I do when capturing the life-log video?'' A user may be able to recollect it by seeing his footprints. The agent draws the user's footprint in the video under playback using a thick light-blue line, and draws other footprints using thin blue lines on the map. By simply ``dragging his mouse'' on the map, the user can change the area displayed on the map. The user can also order the map to display the other area by clicking arbitrary addresses of all the places where footprints were recorded. The user can watch the desired scenes by choosing arbitrary points of footprints. Figure 7. Retrieval using the town directory Because the locations of all the supermarkets visited must be indicated in the town directory database, the agent accesses the town directory, and finds one or more supermarkets near his footprints including Shop A. The agent then shows the user the formal names of all the supermarkets which he visited and the time of visits as retrieval results. Probably he chooses Shop A from the results. Finally, the agent knows the time of the visit to Shop A, and displays the desired scene. In Query A, the conversation was held at a shopping center in Shinjuku. The agent may make mistakes, for example, to the query shown above. Even if the user has not actually been into Shop A but has passed in front of it, the agent will enumerate that event as one of the retrieval results. Figure 6. Interface for retrieval using a map Moreover, the agent has a town directory database. The database has a vast amount of information about one million or more public institutions, stores, companies, restaurants, and so on, in Japan. Except for individual dwellings, the database covers almost all places in Japan including small shops or small companies that individuals manage. In the database, each site has information about its name, its address, its telephone number, and its category with layered structures. Using this database, a user can retrieve his life-log videos as follows. He can enter the name of a store or an institution, or can input the category. He can also enter the both. For example, we assume that the user wants to review the scene in which he visited the supermarket called ``Shop A'', and enters the category-keyword ``supermarket''. To filter retrieval results, the user can also enter the rough location of Shop A, for example, ``Shinjuku-ku, Tokyo''. Figure 8 Retrieval experiments To cope with this problem, the agent investigates whether the GPS signal was received during the event. If the GPS became unreceivable, it is likely that the user went into Shop A. The agent investigates the length of the period when the GPS was unreceivable, and equates that to the time spent in Shop A. If the GPS did not become unreceivable, the user most likely did not go into Shop A. We examined the validity of this retrieval technique. First, we went to Ueno Zoological Gardens, the supermarket ``Summit'', and the drug store ``Matsumoto-Kiyoshi''. We found that this technique was very effective! For example, when we referred to a name-keyword ``Summit'', we found the scene that was captured when the user was just about to enter ``Summit'' as the result. When we referred to the category-keyword ``drug store'', we found the scene that was captured when the user was just about to enter ``Matsumoto-Kiyoshi'', and similarly for Ueno Zoological Gardens. These retrievals were quickly completed; retrieval from videos for three-hours took less than one second. Keys Obtained from Time Data The agent records the time by asking the operating system for the present time, and associates contents of life-log videos with the time when they were captured. In Query A, the conversation was held in mid-May. Keys Obtained from the Internet The life-log agent records the weather and news on that day, web pages that the user browses and emails that the user transmitted and received. These data are automatically associated with time data. Afterwards, these data can be used as keys for life-log videos retrieval. In Query A, the conversation was held after the user received the email from Kenji on a cloudy day when the Lower House general election was held. Figure10. A result of retrieval from PowerPoint-document held after the user made presentation about life-log. (assume that PowerPoint was used at his presentation.) Reversely, the agent can also perform video-based retrieval for such documents including web pages and emails. Keys Added by the User The user can order the life-log agent to add retrieval keys (annotation) with an arbitrary name by simple operations on his cellular phone while the agent is capturing a life-log video. This enables the agent to identify a scene that the user wants to remember throughout his life, and thus the user can access easily to the videos that were captured during precious experiences. Retrieval with a Combination of Keys Consider Query A again. The user may have met Kenji many times during some period of time. The user may have gone to a shopping center many times during the period. The user may have made presentation about life-log many times during the period...etc. Accordingly, if a user uses only one kind of key among the various kinds of keys when retrieving life-log videos, too many results which he does not desire will appear. By using as many different keys as possible, only the desired result may be obtained, or at least most of the undesired results can be eliminated. Figure 9.A result of retrieval from Web-document Keys Obtained from Various Applications All the document files (*.doc; *.xls; *.ppt; *.pdf) that user opens are copied and saved as text. These copied document files and text data are automatically associated with time data. Afterwards, these text data can be used as keys for life-log videos retrieval. In Query A, the conversation was CONCLUSION By using the data acquired from various sources while capturing videos and combining these data with data from some databases, the agent can estimate its user's various contexts with high accuracy and high speed that do not seem achievable with conventional methods. These are the reasons the agent can respond to video retrieval queries of various forms correctly and flexibly. 1. S.Mann, `WearCam' (The Wearable Camera), In Proc. of ISWC 1998, 124-131. 7. M.Lamming and M.Flynn, Forget-me-not: intimate computing in human memory, In Proc. FRIEND21, Int. Symp. Next Generation Human Interface, Feb.1994 2. J.Healey, R.W.Picard, A Cybernetic Wearable Camera, In Proc. of ISWC 1998, 42-49. 8. B.J.Rhodes, The wearable remembrance agent: a system for augmented memory, In Proc. of ISWC 1997 3. J. Gemmell, G. Bell, R. Lueder, S. Drucker, C. Wong, MyLifeBits: fulfilling the Memex vision, In Proc. of ACM Multimedia 2002, 235-238. 9. N.Kern et al., Wearable sensing to annotate meeting recordings, In Proc. of ISWC 2002 REFERENCES 4. K.Aizawa, K.Ishijima, M.Shiina, Summarizing Wearable Video, In Proc. of IEEE ICIP 2001, 398-401. 10. A.Dey et al., The conference assistant : combining context-awareness with wearable computing, In Proc. of ISWC 1999 5. Y.Sawahata, K.Aizawa, Wearable Imaging System for Summarizing Personal Experiences, In Proc. of IEEE ICME 2003. 11. T.Kawamura, Y.Kono, M.Kidode, Wearable interface for a video diary: towards memory retrieval, exchange and transportation, In Proc. of ISWC 2002 6. T.Hori, K.Aizawa, Context-based Video Retrieval System for the Life-log Applications, In Proc. of MIR 2003, ACM, 31-38. Figure 11. Interface of the life-log agent for browsing and retrieving life-log videos Exploring Graspable Cues for Everyday Recollecting Elise van den Hoven Industrial Design Department Eindhoven University of Technology P.O.Box 513, Den Dolech 2, 5600 MB Eindhoven, The Netherlands +31 40 247 8360 e.v.d.hoven@tue.nl ABSTRACT This paper gives a short overview of a four-year PhDproject which concerned several aspects of a device which helps people to recollect personal memories in the context of the home. Several studies were done on related topics, such as: autobiographical memory cuing, using souvenirs in the home and developing the user-system interaction of a portable digital photo browser. Keywords Everyday Recollecting, Ambient Intelligence, Recollection-Supporting Device, Digital Photo Browser, Graspable User Interfaces, Tangible Souvenirs. INTRODUCTION Most people are actively dealing with their personal memories. Take for example a woman who just returned from a holiday. Probably this person talks about her experiences with various people, which in fact is the rehearsal and perhaps the fixation of her holiday memories. When she refers to other holidays in the same conversation she is trying to relate her new memories to other existing memories, therefore she is working on her old memories at the same time. And there is a fair chance that her listeners are doing the same thing. Since most people reminisce everyday and the results of this process shape their personal histories and thus their identities this is an important process, which often goes unnoticed. Today, with the increasing digitalization of memory carriers, such as digital photos, this remembering or reminiscing can be aided in ways previously impossible. In this paper the possibilities of supporting people in dealing with their memories with increasing digital support are investigated. Context The work described in this paper was done as a four-year PhD-study [1] both at Philips Research Laboratories Eindhoven and at the Eindhoven University of Technology. Currently the author is continuing this work at the LEAVE BLANK THE LAST 2.5 cm (1”) OF THE LEFT COLUMN ON THE FIRST PAGE FOR THE COPYRIGHT NOTICE. Eindhoven University of Technology as an assistant professor in the Industrial Design department. The work was concerned with the topic of supporting inhome recollecting. The content of this work was influenced by both the project context as well as the industrial context. The project team decided together on the aim of the work, which was to build a demonstrator of a “RecollectionSupporting Device”. The industrial context of this project was that it was part of the Ambient Intelligence research program at Philips Research. Paper Outline The following section of this paper gives an overview of the abovementioned PhD-thesis, which is followed by some sections on relevant topics worked out in more detail. THESIS OVERVIEW Several studies were performed in order to explore the wide area of recollecting memories in the home context. The first study tested with questionnaires how people use souvenirs in the home. It confirmed that souvenirs can be seen as external memory and that they are suitable candidates to be used as tangibles in a graspable user interface for the Recollection-Supporting Device. The second study focused on the analysis, design, implementation and evaluation of a user interface for browsing and viewing digital photos on a touch screen device. This user interface consisted of a graphical and a graspable part, the latter using personal souvenirs as tangible user interface controls. The research into the use of tangibles led to an extension of the current Graspable UI-categorization, which mentioned only so-called “generic” objects. From the souvenirs it was learnt that compared to generic objects personal objects have the benefit that users already have a mental model and the object is embedded in the user’s personal environment. The Digital Photo Browser raised some issues on memory cuing. Therefore, an experiment was conducted which compared the effect of modality (odor, physical object, photo, sound and video) on the number of memories people had from a unique one-day event. During this event all above-mentioned modalities were present and they were later used to cue the participants. Against expectation, the no-cue condition (in effect only a text cue) created on average significantly more memories than any of the cued conditions. The given explanation for this effect is that “specific cues” can make people focus on the perceived information, whereas text leaves space for reflection. In view of the inherent qualities to souvenirs, representing a memento for storing and stimulating memories, the physical-object cue condition was expected to do better than it did in practice. Before concluding that this expectation was not confirmed it was tested whether the participants in the cuing study indeed viewed their personally handmade artefacts as souvenirs. It turned out that most of them did and therefore it had to be concluded that souvenirs cued fewer memory details than text-only cues. All the information from the above-mentioned studies served as input for the last part of the thesis, which summarizes guidelines for designers who want to realize a future Recollection-Supporting Device. This part comprises a literature overview, a lessons-learned section and some future directions. Although this thesis answers a lot of questions about several aspects of a Recollection-Supporting Device, still a lot of work has to be done in order to realize one, because this multidisciplinary area appeared to be rather unexplored. AUTOBIOGRAPHICAL MEMORY THEORY Recollecting personal experiences concerns Autobiographical Memory (AM), which is defined as “memory for the events of one’s life” [2]. AM, which is a part of Long-Term Memory, includes all the memories people have that have something to do with themselves including traumatic experiences. between the cue and the to-be-remembered event. A combination of cues increases the chance of retrieving a memory, especially when a subject in a cued-recall experiment had to perform activities, such as write with a pen or close a door (e.g. [4]). One example of a memory cue is a souvenir. SOUVENIRS The word souvenir originates from Middle French from (se) souvenir (de) meaning “to remember”, which again comes from the Latin word subvenire meaning “to come up, come to mind”. From a questionnaire study with 30 participants it was concluded that many people appeared to have a collection of souvenirs at home. This collection contained on average over 50 souvenirs of the following three categories: holiday souvenirs, heirlooms and gifts. All three categories made the participants recollect memories when they looked at their most valuable souvenirs, meaning they serve as external memory for those people. Three quarters of the participants brought souvenirs from their holidays but most of them did not throw away any during the last year. Eighty percent of the participants thought self-made objects could be souvenirs. When participants were asked to name their most valuable souvenir, only half of these objects were from a holiday. 1. The construction and maintenance of the self-concept and self-history, which shapes personal identity; 2. Regulating moods; 3. Making friends and maintaining relationships by sharing experiences; Neisser (1982) describes a study on external memory aids used by students. They were asked what aids they used to remember future or past events and one of the results was that students do not know which types of external memory they use, unless they are explicitly mentioned, such as “do you use diaries for remembering”. This result is consistent with results found in the investigation presented in this chapter, because the souvenir-questionnaire participants did not mention remembering as a function of their souvenirs. But apparently they did use their souvenirs as external memory, because when they were asked what happened when they looked at their most-cherished souvenirs half of the participants mentioned that memories popped up or were relived. 4. Problem-solving based on previous experiences; DIGITAL PHOTO BROWSER 5. Shaping likes, dislikes, enthusiasms, beliefs and prejudices, based on remembered experiences; 6. Helping to predict the future based on the memories of the past. After learning that souvenirs can be used as external memory to the souvenir owners, it was decided to build a demonstrator with souvenirs. Together with the project team it was decided to focus on digital photos and to implement a Digital Photo Browser (see Figure 1, [5]). This device and the user interface were designed and implemented based on requirements which were derived from a scenario and a focus group. Based on these requirements a user-interface concept was designed, that reminds people of their photos by continuously scrolling them along the display. According to Cohen [3] six functions of Autobiographical Memory can be distinguished: Personal memories are important to people, which can be derived from the different range of functions, from solely internal usage, to communication between people. Cuing memories is one way of retrieving autobiographical memories. A cue (or trigger) is a stimulus that can help someone to retrieve information from Long-Term Memory, but only if this cue is related to the to-be-retrieved memory. The stimuli most often used in studies are photos, smells or text labels. But anything could be a cue (a spoken word, a color, an action or a person), as long as there is a link The user interface of the Digital Photo Browser (see Figure 2) consists of three areas: 1 - an area on the left which shows a moving photo roll, 2 - a central area which allows enlarging individual photos, both in landscape and portrait format, 3 - an area on the right where icons of the current user (3a), of other display devices (3b) or of detected graspable objects (3c) can be shown. The roll (1), which shows on average eight thumbnails on-screen, consists of two layers: the first layer shows an overview of all the albums owned by the current user and the second layer shows the contents of each album. This second layer is accessible by clicking on an album-icon, one can return to the first layer by clicking the “back”-button. In short the Digital Photo Browser is a portable and wireless touch-screen device. The user can interact via touch (drag and drop) and the physical objects. These objects are RFID-tagged and are recognized when placed on a special table. Immediately the corresponding photos are shown on the portable device. Via a simple drag and drop the photo can be enlarged on this device or any other screen which is available for viewing photos. CUING AUTOBIOGRAPHICAL MEMORIES From the AM-theory and from observing people using the Digital Photo Browser it was learned that the ideal “recollection-supporting” device cannot and should not contain the memories, but the cues to the memories. That is why the suitability of several types of recollection cues including photos, audio, video, odor, and graspable objects was investigated [7,8]. In order to test this, 70 participants joined in a standardized real-life event and one month later they were cued in a laboratory living room setting, when filling out questionnaires, either without a cue or with a photo, object, odor, audio or video cue. In addition, a special method was developed in order to analyze the number of recollection details in these written free recall accounts. Fig. 1. The Digital Photo Browser and some souvenirs in an intelligent living room. When brought into an intelligent room, the implemented Digital Photo Browser is able to recognize the presence of people, graspable objects, and available output devices. Since souvenirs are suitable for use in a Graspable User Interface and they have the ability to cue recollections, souvenirs are used as shortcuts to sets of digital photos. (This similar interaction was presented in scenarios of the POEMs project, which stands for Physical Objects with Embedded Memories [6].) Although this study is presumably the first to investigate a real-life event, which compares quantitatively recollections across different media types, it is perhaps also the first to find a negative effect of cues on the number of memories produced compared to a no-cue situation. Because the main result from this study shows that the no-cue condition for the recall of a real-life event generated significantly more memory details compared to any of the cue-conditions (object, picture, odor, sound and video). This is against expectation, since the encoding specificity principle, which states that environmental cues that match information encoded in a stored event or memory trace cue recollection of the complete memory (see [9] for an overview on context-dependent memory), and several other studies (see [1] for an overview) do predict and show a positive cuing effect on memory recall. In order to explain this result, it is hypothesized that cues might have a filtering effect on the internal memory search resulting in fewer memories recalled with a cue compared to the no-cue condition. LESSONS LEARNED The recommendations of Stevens et al. [10] which were derived from their study but some of them were independently uncovered in the work presented in this paper, will be mentioned here, since they are important for the design of a recollection-supporting device: Fig. 2. A sketch of the Digital Photo-Browser user interface (for an explanation see text). • Develop the process of annotating or organizing memories into an activity of personal expression. • Make the inclusion of practically any object possible. • Bring the interaction away from the PC. • Develop “natural” interactions (i.e. touch and voice). • Encourage storytelling at any point. • Assure the capability of multiple “voices”. • Create unique experiences, especially for creating and viewing annotations. The design recommendations given by Stevens et al. [10] were the starting point for the lessons learned mentioned here, which are based on all the chapters in the thesis [1]: • Include Device. • Souvenirs should be used as tangibles in a Graspable User Interface of a RSD. • Support the personal identity of the user and the communication to other people. • More media types than just text should be used in the RSD. • The RSD should not pretend to know the truth, since this might interfere with the needs of the user. • Create a metadata system that can be changed easily by the user. souvenirs in a Recollection-Supporting ACKNOWLEDGMENTS The author would like to thank her supervisors: prof. Eggen, prof. Kohlrausch and prof. Rauterberg. And in addition the other members of the project team that created the Digital Photo Browser: E. Dijk, N. de Jong, E. van Loenen, D. Tedd, D. Teixeira and Y. Qian. REFERENCES 1. Hoven, E. van den (2004). Graspable Cues for Everyday Recollecting, Ph.D. thesis, Eindhoven University of Technology, The Netherlands, May 2004, ISBN 90-386-1958-8. 2. Conway, M. A. and Pleydell-Pearce, C. W. (2000). The Construction of Autobiographical Memories in the Self- Memory System, Psychological Review, 107 (2), 261288. 3. Cohen, G. (1996). Memory in the real world, Hove, UK: Psychology Press. 4. Engelkamp, J. (1998). Memory for actions, Hove, UK: Psychology Press. 5. Hoven, E. van den and Eggen, B. (2003). Digital Photo Browsing with Souvenirs, Proceedings of the Interact2003 (videopaper), 1000-1004. 6. Ullmer, B. (1997). Models and Mechanisms for Tangible User Interfaces, Masters thesis, MIT Media Laboratory, Cambridge, USA. 7. Hoven, E. van den and Eggen, B. (2003). The Design of a Recollection Supporting Device: A Study into Triggering Personal Recollections, Proceedings of the Human-Computer Interaction International (HCI-Int. 2003), part II, 1034-1038. 8. Hoven, E. van den, Eggen, B., and Wessel, I. (2003). Context-dependency in the real world: How different retrieval cues affect Event-Specific Knowledge in recollections of a real-life event, 5th Biennial Meeting of the Society for Applied Research in Memory and Cognition (SARMAC V), Aberdeen, Scotland, July 2003. 9. Smith, S. M., and Vela, E. (2001). Environmental context-dependent memory: A review and metaanalysis, Psychonomic Bulletin and Review, 8 (2), 203220. 10. Stevens, M. M., Abowd, G. D., Truong, K. N., and Vollmer, F. (2003). Getting into the Living Memory Box: Family Archives & Holistic Design, Personal and Ubiquitous Computing, 7 (3-4), 210-216. Remembrance Home: Storage for re-discovering one’s life Yasuyuki KONO Graduate School of Information Science, NAIST Keihanna Science City, 630-0192, JAPAN kono@is.naist.jp http://ai-www.aist-nara.ac.jp/~kono/ ABSTRACT Remembrance Home is a project for supporting one's remembrance throughout his/her life by employing his/her house as storage media for memorizing, organizing and remembering his/her everyday activity. The Remembrance Home stores his/her everyday memories which consist of digital data of both what he/she has ever seen and what he/she has ever generated. He/she can augment his/her memory by passively viewing slide-shown images played in ubiquitously arranged displays in the house. The experiments have shown that the prototype system that contains over 570,000 images, 35,000 titles of hypertext data, and 250,000 of hyperlinks among them, augments his/her remembering activity. Kaoru MISAKI office ZeRO 2-25-27 Motoizumi, Komae 201-0013, JAPAN misaki_kaoru@nifty.ne.jp http://homepage3.nifty.com/misaki_kaoru/ the house. Author Keywords The project started in the year 2000. We have employed Kaoru Misaki’s house as the prototype. It is equipped with several LCDs and some video projectors embedded into walls and furniture. These display devices continuously and automatically slide-show digitized and stored still images of his life-slice, e.g., photos, books, notebooks, and letters. The number of the images exceeds 570,000 and is increasing by 20,000 per month in average. The hyperlink structure that currently consists of the images and 35,000 titles of texts he has ever written is getting larger by his rediscovering activities of his past triggered by the slideshown images. We have empirically found that browsing one’s past by passively viewing digitized images activates his/her remembrance activity. Augmented Memory, Remembrance Home, LifeLog, Passive Browsing THE REMEMBRANCE HOME PROJECT Overview of Lifetime Memories INTRODUCTION The Remembrance Home stores one’s (and his/her family’s) digitized lifetime memories. Information technologies would provide virtually unlimited storage for storing one’s life, i.e., both what he/she has ever experienced/seen (documents, photos, movies, graphics, books, notes, pictures, etc.) and generated (text articles, drawings, etc.). The digital record must enrich his/her life because he/she augments his/her memory by re-discovering his/her past experience. A house can be media for storing/recording its residents’ memories, e.g., portraits on furniture, children’s doodles on a wall, in all ages. Viewing a record in the real world triggers off one’s remembering of the experiences associated with it. The Remembrance Home is a prototype house of next-era for augmenting human memory by naturally integrating digital devices into Kaoru Misaki’s house 1 was rebuilt to set his lifetime memory storage and to install a memory browsing and rediscovering environment. His lifetime memory consists of everything that either he has ever seen or he has ever generated, and that can be digitized. What he has ever seen mainly consists of 1) photo images he has ever taken, and 2) paper materials, he had stored either in his house or his parents’ house, such as books, magazines, leaflets, textbooks, and letters. The paper materials were taken apart into sheets and each page is digitized as a JPEG image file by digital scanners. What he has ever generated mainly consists of 1) digitally written documents such as articles, diaries, and e-mails, and 2) paper materials he wrote/drew such as diaries, letters, articles, and notebooks, that were also digitized into JPEG images. Digitizing of paper materials have been outsourced and is in progress. The number of digitized images increases about 20,000 files a month. Because the pace of the increase is so rapid and the digitizing has not been performed by himself, it is impossible to make symbolic annotations to each image synchronously with its digitizing process as is performed in MyLifeBits project [1]. The images and texts are manually 1 He is a technical journalist who usually works in his library to write articles. linked with each in his daily activity. Whenever he is inspired by viewing digitized images, he manually establishes hyperlinks between the images and associable texts. His lifetime memory consists of over 570,000 images, 35,000 titles of texts, and 250,000 of hyperlinks among them. About 100,000 of the images are digital photos which are newly taken and the rest are scanned materials that belong to his past. The data structure is on the BTRONbased environment where the user can easily establish hyperlinks among data on its GUI. Figure 1. LCDs embedded into the library desk. The left display slide-shows digitized images. Figure 4. Bookshelf before the project. Figure 2. LCDs settled in the dining-kitchen. Figure 5. Current bookshelf. Papers have gone away. Furnishings and Their Settings Figure 3. PC Screen projected on the wall in the library. In one’s living space, necessary but uncomfortable objects for daily living such as paper files, computing devices, cables, and audiovisual equipments, should be transparent/invisible to him/her. In the Remembrance Home, computers, storages, audiovisual equipments, and most of keyboards are set under the floor. Most cables are embedded in walls and ceilings. Several LCDs and some video projectors are ubiquitously settled in the house so as to naturally merge into the environment (See Figure 1-3). The amount of documents in bookshelves/cabinets has extremely reduced, because they were discarded after the digitizing (See Figure 4-5). hyperlinked from existing text data with additional text annotation. Sometimes inspired by the browsed image, he created a new text file to write down the re-discovered event by detecting the era, the year, the month, or the day associated with it. We call such kind of hypertexts that make mention of re-discovered his past experiences the “past diary.” Figure 7 shows an example of a digitized image. Elementary School Times, 1978-1980, 1981-1983, 1984, 1993, 1985, 1986, 1987, 1988, 1989, 1992, 1994 Figure 8. List of past diary titles before Apr. 2002. 1965-1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1977/06, 1977/07, 1977/08, 1977/09, 1977/10, 1978, 1978/04, 1978/08, 1979, 1979/08, 1979/12, 1980, 1980/02, 1980/04, 1980/05, 1980/06, Figure 6. Embedded hyperlinks in a past diary text. 1980/07, 1980/08, 1980/09, 1980/10, 1980/11, 1980/12, 1981, 1981/02, 1981/03, 1981/04, 1981/06, 1981/07, 1981/08, 1982, 1982/04, 1982/10, 1983, 1984/06, 1984/11, 1984/12, 1988/05, 1988/06 Figure 9. List of past diary titles created after Apr. 2002. Developing Memory Browsing Environment Although symbolic annotation creation is crucial for active browsing of non-text media [2], rapid increase of scanned images prevented him from on-time annotation creation. The time interval of most of the images, e.g., the day he got the original material, the day the original material was distributed, or the day of the event it reported, was ambiguous/unknown. It takes around 36 days to merely view all the 570,000 images for 2 seconds each, if he spends 8 hours for the task per day. Figure 7. Example of digitized image (a notebook page of a class when he was a high school student). WORKING/LIVING IN THE REMEMBRANCE HOME The Remembrance Home Project started in the year 2000 by Kaoru Misaki. Paper materials have been continuously digitized month by month and stored into a Windows-based file system. At the beginning of the project, he had daily diary text data started in June 1986 stored into the BTRONbased file system that is suitable for making annotations and hyperlinks among data (see Figure 6). Each digitized image (page) was originally annotated by the following two features: 1) the day and the time it was scanned as the timestamp of the image file, and 2) the title of the set of pages as the folder name such as the book title. Additionally, the following digital data were kept in the storage in average: (a) 20 e-mail texts a day, (b) 10 web pages a day, and (c) 100 digital photo images a day. In the early stage of the project, each image was manually browsed by him and was The passive browsing method where he views periodically slide-shown images is employed, after active browsing where he actively selected a folder and viewed thumbnails in it was applied. Active browsing became harder as the number of images increased. The difficulty prevented his motivation for re-discovering in April 2002, when the number exceeded 100,000. We have employed “JPEG Saver,” a freeware screensaver, for randomly showing images into ubiquitously settled screens in the house [3]. By switching the browsing style, his re-discovering activity has extremely activated. Inspired by randomly and daily shown images, he has re-discovered his past experiences step by step. His past diary has become more detailed and accurate. By remembering the detail of each his past experience, he has discriminated his diary files into months while his past diary was divided by years or school-times before the passive method was employed (See Figure 8 and 9). Before April 2002, he had 129 diary texts among which 12 (9%) were past diaries. The total size of these diary texts was approximately 230K bytes. After the switch, he has created 68 diary files among which 48 (72%) are past diaries. The total size of these diary texts becomes diary texts. This indicates that passively viewing slideshown images contextually associated with past experiences explosively activates his past-rediscovering activity, i.e., referring to past experiences and annotating the past diary. approximately 855K bytes. Furthermore, 33 files (66%) of the past diaries are divided by month, i.e., each title contains not only the year but the month, as depicted in Figure 9. Most of digital contents in the storage is in either text or JPEG image format. A photo image captures surroundings with object(s) of the photo in general. By repeatedly viewing an image within certain period of time, viewer’s intention moves into detail of surroundings. Such transition must activate his further pastrediscovering activity. 1 6 0 00 0 0 G enerated past diaries (bytes) per m onth 1 4 0 00 0 0 To tal size (bytes) of past diary 1 2 0 00 0 0 1 0 0 00 0 0 8 0 00 0 0 CONCLUDING REMARKS 6 0 00 0 0 4 0 00 0 0 2 0 00 0 0 0 /1 7 4 /0 03 /0 03 20 20 0 1 03 /1 /0 03 20 20 4 7 /0 02 /0 02 02 20 20 0 1 /0 20 7 /1 /0 01 02 20 20 1 4 /0 01 20 0 /0 01 20 7 /1 01 /0 00 00 20 20 4 1 /0 /0 00 20 20 7 0 /1 00 99 19 20 1 4 /0 /0 99 19 0 /0 99 /1 98 99 19 19 7 4 /0 /0 98 19 19 0 1 98 /0 /1 98 19 97 19 19 7 /0 97 19 19 97 /0 4 0 Figure 10. Total size of past diary texts. 8000 7000 Total # of hyperlinks from diray texts Total # of hyperlinks to past diary texts 6000 5000 4000 3000 This paper introduced the concept of the Remembrance Home that supports one's remembrance throughout his/her life by employing his/her house as storage media for memorizing, organizing and remembering his/her everyday activity. This paper also described the design and current implementation of the Remembrance Home. We have been digitizing Kaoru Misaki’s lifetime memories and storing into the house. The memories must be one of the biggest personal and digital memory archive, although large scale social and digital logging projects are in progress [5, 6]. Passively viewing the memories augments his memory and activates his past-rediscovering activity. The digitizing is still in progress over 100 times faster than that of MyLifeBits [2]. The Remembrance Home is going to store around 3 million images in 10 years. We are also planning to enhance triggers for one’s remembrance activity from PC screens to the real world, by providing means for 1000 hyperlinking among his external memory elements and real world indexes. As depicted 0 in Figure 5, paper materials have gone away by the project. It means that contexts have been replaced into symbolic annotations and that only indexical objects whose Figure 11. Total numbers of hyperlinks from/to past diary texts. shapes/existences have some meanings for him are left. We should have means for annotating Explosion of Past-Rediscovering Activity one’s memory by beings in the real world. We have already We have empirically found that browsing his past by proposed the framework for memory albuming systems, passively viewing digitized images in daily life extremely named SARA, that employs real world objects as media for activates his past-rediscovering activity. By switching the augmenting human memory, by providing its users with style of browsing digitized images to passive one, his functions for memory retrieval, transportation, editing, and description in the past diary has been more detailed as exchange [4]. We believe that integrating the framework mentioned above. Figure 10 shows the trend of total size of into the Remembrance Home brings us a new vision for past diary texts. This indicates that the amount of both augmenting one’s memory. The Remembrance Home description on his past experiences extremely increased must also provide its family members with the means for after Apr. 2002, i.e., the total size of past diary texts is sharing digitally augmented memories. approximately 4 times than that before the switch. Figure 11 shows the trend of numbers of hyperlinks from/to past 19 19 97 /0 4 97 / 19 07 97 / 19 10 98 / 19 01 98 / 19 04 98 / 19 07 98 / 19 10 99 / 19 01 99 / 19 04 99 / 19 07 99 / 20 10 00 / 20 01 00 / 20 04 00 / 20 07 00 / 20 10 01 / 20 01 01 / 20 04 01 / 20 07 01 / 20 10 02 / 20 01 02 / 20 04 02 / 20 07 02 / 20 10 03 / 20 01 03 / 20 04 03 / 20 07 03 /1 0 2000 REFERENCES 1. MyLifeBits Project. http://research.microsoft.com/research/barc/MediaPrese nce/MyLifeBits.aspx 2. Gemmell, J., Lueder, R., and Bell, G. Living with a Lifetime Store. Proc. ATR Workshop on Ubiquitous Experience Media, Sept. 2003. http://www.mis.atr.jp/uem2003/WScontents/dr.gemmell .html 3. JPEG Saver. http://hp.vector.co.jp/authors/VA016442/delphi/jpegsav erhp.html. (in Japanese) 4. Kono, Y., Kawamura, T., Ueoka, T., Murata, S. and Kidode, M. Real World Objects as Media for Augmenting Human Memory, Proc. Workshop on Multi-User and Ubiquitous User Interfaces (MU3I 2004), 37-42, 2004. http://www.mu3i.org/ 5. American Memory. http://memory.loc.gov/ 6. Wikipedia. http://en.wikipedia.org/wiki/Main_Page An Object›centric Storytelling Framework Using Ubiquitous Sensor Technology Norman Lin ATR Media Information Science Laboratories Seika›cho, Soraku›gun, Kyoto 619›02 JAPAN nlin@atr.jp Kenji Mase Nagoya University Furu›cho, Chigusa›ku, Nagoya City 404›8603 JAPAN mase@itc.nagoya›u.ac.jp Yasuyuki Sumi Kyoto University Yoshida›Honmachi, Sakyo›ku, Kyoto 606›8501 JAPAN sumi@acm.org ABSTRACT Using ubiquitous and wearable sensors and cameras, it is possible to capture a large amount of video, audio, and interaction data from multiple viewpoints over a period of time. This paper proposes a structure for a storytelling system using such captured data, based on the object-centric idea of visualized object histories. The rationale for using an object-centric approach is discussed, and the possibility of developing an observational algebra is suggested. Author Keywords ubiquitous sensors, storytelling, co-experience, experience sharing ACM Classification Keywords H.5.1. Information Interfaces and Presentation (e.g., HCI): Multimedia Information Systems INTRODUCTION Previous work has developed an ubiquitous sensor room and wearable computer technology capable of capturing audio, video, and gazing information of individuals within the ubiquitous sensor environment[4 6]. Ubiquitous machine-readable ID tags, based on infrared light emitting diodes (IR LED’s), are mounted throughout the environment on objects of interest, and a wearable headset captures both a first-person video stream as well as continuous data on the ID tags currently in the field of view (Figure 1). The data on the ID tags currently in the field of view represents, at least at a coarse level, which objects the user was gazing at during the course of an experience. In this paper, we propose using an object-centric approach to organizing and re-experiencing captured experience data. The result should be a storytelling system structure based on visualized object histories. In the following sections we explore what is meant by an object-centric organizational approach, and present a structure for a storytelling system based on this object-centric idea. Figure 1: Head-mounted sensors for capturing video and gazing information. GOALS The long-term goal of this research is to develop a paradigm and the supporting technology for experience sharing based on data captured by ubiquitous and personal sensors. In a broad sense, the paradigm and technology should assist users in (a) sharing, (b) contextualizing, and (c) re-contextualizing captured experiences. Ubiquitous sensors should automatically capture content and context of an experience. A storytelling system should then allow users to extract and interact with video-clip based representations of the objects or persons involved in the original experience, in a virtual 3D stage space (Figure 2). AN OBJECT-CENTRIC APPROACH The central structuring idea is to focus on objects – physical artifacts in the real world, tagged with IR tags and identifiable via gazing – as the main mechanism or agent of experience generation. Other projects using objects to collect histories include StoryMat[5] and Rosebud[2]; also, [7] discusses the importance of using objects to share experience. By focusing on objects in this way, ubiquitous and wearable sensors and cameras allow the capturing and playback of personalized object histories from different participants in the experience. An object “accumulates” a history based on persons’ interactions with it, and the ubiquitous sensor and capture system records this history. A storytelling system should allow playback and sharing of these personalized object histories. By communicat- original context of certain objects with respect to a personalized experience. Re-contextualization, on the other hand, would involve using object A in a new contextstorytelluppose that a second user never saw object A and object B together, but instead saw object A and object C together. From this second user’s personal perspective, A and C are related, but from the first person’s personal perspective, A and B are related. By allowing video clips of A, B, and C to be freely combined in a storytelling environment, and by comparing the current context with pre-recorded and differing personal contexts, we allow the storytellers and audience to illustrate and discover new perspectives, or new contexts, on objects of interest. New stories and new contexts about the objects can be created by combining their captured video histories in new ways. Figure 2: Virtual 3D stage space for storytelling using visualized object histories. ing personalized object histories to others, personal experience can be shared. Storytelling: Visualizing and Interacting with Object Histories Having captured video of personalized object histories, we would like to allow the interaction with those objects and their personalized histories in a multi-user environment to facilitate storytelling and experience sharing. Currently, a 3D stage space based on 3D game engine technology is being implemented (Figure 2). Within this 3D stage space, users can navigate an avatar through a virtual “stage set” and interact with videobillboard representations of objects in the captured experiences. Contextualization and Re-contextualization Objects are typically not observed or interacted with in isolation; instead, objects are typically dealt with in groups. In terms of the physical sensor technology, this means that when one (tagged) object A of interest is being gazed at, another object B which is also in the current field of view becomes associated with object A. This is one form of context: object A was involved with object B during the course of the experience. The current working thesis is that an object-centric storytelling system should remember the original context, but should also separate or loosen an object from its original context. The reasoning is that by remembering but loosening the original context, we can both remind the user of the original context (contextualization) as well as allowing the user to reuse the object in another context (re-contextualization). Concretely, for instance, consider the case where we have two objects, object A and object B, which are recorded on video by a personal head-mounted camera, and which are seen simultaneously in the field of view. Later, if the storyteller plays back a video clip of object A in order to share his experience about object A with someone else, the storyteller should be reminded in some way that object B is also relevant, because it was also observed or gazed at with object A. This is what is meant by system support for contextualization of experience. Essentially, the storytelling system serves as a memory aid to remember the Towards an Algebra of Observations The object-centric idea presented above is that an object accumulates history, and that this object history is an agent for generating experience and an agent for transmitting experience to others. Part of this idea is that not only do individual objects have experiential significance, but also groups of objects carry some semantic meaning. A group of objects can be considered to be a “configuration” or a “situation” - in other words, a higher-level semantic unit of the experience. An object’s history should be associated with the situations in which that object was seen. For example, the fact that objects A and B are observed together by a user means that object A is related, through situation AB, with object B. The situation AB is the higher-level grouping mechanism which relates objects to one another through their common observational history. If we accept this, then, just as we can speak of the history of an object, we can also speak of the history of a situation. Just as we can relate objects with one another, we can relate situations with one another, or objects with situations. These situations can furthermore be grouped into even larger situations. This leads to a sort of hierarchy or continuous incremental experience structure, and suggests the possibility of developing an algebra for describing and reasoning about observations, situations, and higher-level groups. As an example of the kind of questions which such an observational algebra might answer, consider the case of three objects A, B, and C. User 1 observes objects A and B together, forming situation AB. User 2 observes objects B and C together, forming situation BC. Then, in the storytelling environment, user 1 and user 2 collaboratively talk about objects A and C together, forming new situation AC. What then is the relationship among situations A, AB, B, BC, C, and AC? Future work will explore this idea further. The Value of Context By capturing the context of objects as observed, we provide for the later possibility to understand the original context of objects or object groups. We aim to answer questions of the following forms: In what situations was a particular object involved? In what situations was a particular group of objects involved? To what degree are other situations related to the currently chosen object or object group? By capturing context and defining a comparison metric, these types of questions can be answered. The value of answering these questions is that it provides an intuitive, object-centric way of understanding, organizing, and telling stories about experience. It also provides a method of showing both strong and weak relations to other parts of the experience. The MyLifeBits project [1] also emphasizes the importance of “linking” to allow the user to understand context of captured bits of experience. As an example, one can imagine a person who works as a home decorator, who uses a variety of furnishings meant to decorate the interior of a home. The decorator has built up an experience of creating several different configurations of objects in different situations. When trying to create a new decoration, it can be useful to try to group objects together (e.g. potted plant, shelf, and lamp), then see what previous situations, from the personalized experience corpus, have used this object group before. When illustrating to a client the decoration possibilities, the decorator, as a storyteller, could select candidate object groups and tell a story (by playing back related, captured video clips in a 3D stage space) about how those object groups have been used in previous designs. This also points out the value of using other persons’ experience corpora, as it can provide new and different perspectives on how those objects might be combined. The preceding example raises a subtle point not yet addressed, namely, that object types, and not just objects themselves, can also be important in classifying experience. In the above example, the decorator may be less interested in the history of one particular furnishing (e.g. one particular potted plant), but rather may be more interested in past related experiences using some types of furnishings (e.g. past decoration designs using potted plants in general). On the other hand, there are also situations where we are indeed interested in the history of one particular instance of an object. For instance, if a home decorator is involved with regularly re-decorating several homes, then within one specific home it can be useful to understand the specific object history of a specific furnishing. A STORYTELLING SYSTEM Based on the previous object-centric paradigm for experience, this section presents a proposed structure for a storytelling system using visualized object histories. Core technology for this system is currently being implemented. The structure consists of five phases, each of which will be described separately. The five phases are capture, segmentation, clustering, primitive extraction, and storytelling. Experience Experience Experience Experience Experience Map Story planning Storytelling Figure 3: Role of the experience map in the storytelling process. Objects are grouped into situations based on subjective observation. The experience map shows the situations and allows planning and telling a story about the objects in the situations. captured video into equally-sized segments. The second approach is to define an observation vector for each instant1 of the captured video, and to cause formation of a new video segment whenever the observation vector “significantly” changes. The observation vector for an instant is the set of all objects observed by the user during that instant. Therefore, this second approach reasons that whenever the set of observed objects “significantly” changes, that a new situation has in some sense occurred. The reason that two approaches have been considered is that the two approaches tend to yield units (video clips) of different lengths. With the first approach, all resulting video clips are of equal, and short, length. With the second approach, resulting video clips tend to be longer; a new clip starts only when the situation changes. The first approach is more likely to uncover "hidden" patterns in the data because it imposes little structure on the data; the second approach introduces some sort of algorithmic bias, due to the more complicated decision on when a segment ends, but the hope is that this will yield longer, more semantically meaningful segments. The reason that longer video segments may be desirable is that they may serve as a better basis for extracting useful primitives which can be used in storytelling. Clustering Capture In the capture phase, a user captures an experience by wearing a head-mounted camera which records a video stream as well as continuous gazing data, in other words, which tagged objects were seen at any particular point in time during the experience. A tagged object can be either an inanimate artifact or another person; the only importance is that the object has a tag on it so that it can be recognized and recorded by the capture system. Segmentation The goal of the segmentation phase is to break up the captured video data into chunks or segments which can then be compared (the comparison measure is discussed in the next section) and clustered. Two main approaches have been developed for the segmentation. The first approach is simply to divide the Given the segments from the previous segmentation step, the clustering phase compares the segments and clusters them together. The idea is that groups of similar segments form situations. Again, this is based on the object-centric organizing principle discussed earlier. For each segment under consideration, we first generate the observation vector over the entire time interval of that segment. An observation vector for a particular interval of time is a binary-valued vector representing, for all objects under consideration, whether that object was seen or not. Then, we use a clustering algorithm to compare the similarity of segments by comparing their observation vectors. To compare similarity of observation vectors, we have 1 Technically, due to sampling issues, the observation vector cannot be measured instantaneously, but is instead aggregated over a small time-slice with an epsilon duration. in the 3D stage space by generating an observation vector, just as is done in physical space during experience capture; this virtual observation vector defines a current “storytelling situation” which can be mapped onto the experience map to illustrate the storyteller’s “location” in conceptual “story space.” CONCLUSION Given an ubiquitous sensor room capable of capturing video, audio, and gazing data, this paper described the use of an objectcentric approach to organizing and communicating experience by visualizing personalized object histories. A storytelling system structure based on this object-centric idea was proposed. Core technology for this storytelling system is being developed, and work continues on gaining insight into a reasoning framework or algebra for observations. ACKNOWLEDGEMENTS Figure 4: A sample experience map illustrating clusters forming situations. chosen to use the Tanimoto similarity measure[3, p. 16-17], ST (a, b) = (a · b)/((a · a) + (b · b) − (a · b)), with the · operator representing the inner dot product. This essentially is the ratio of common elements to the number of different elements. This research was supported by the Telecommunications Advancement Organization of Japan. Highly valuable contributions to this work were made by Sadanori Itoh, Atsushi Nakahara, and Masahi Takahashi. REFERENCES 1. J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong. Mylifebits: Fulfilling the memex vision, 2002. 2. J. Glos and J. Cassell. Rosebud: Technological toys for storytelling. In Proceedings of CHI 1997 Extended Abstracts, pages 359–360, 1997. Clusters represent situations in the original captured experience; they are groups of video segments, each involving a similar set of objects. By displaying the clusters on a 2D map, and by mapping similar clusters close to each other, we can create a “map” of the experience which can serve as a structural guide and memory aid during storytelling. Figure 3 shows the conceptual role of the experience map, and Figure 4 shows a sample interactive experience map created using magnetic attractive/repulsive forces. 3. Teuvo Kohonen. Self-Organizing Maps. Springer-Verlag Berlin Heidelberg, 1995. Primitive Extraction 6. Yasuyuki Sumi. Collaborative capturing of interactions by wearable/ubiquitous sensors. The 2nd CREST Workshop on Advanced Computing and Communicating Techniques for Wearable Information Playing, Panel "Killer Applications to Implement Wearable Information Playing Stations Used in Daily Life", Nara, Japan, May 2003. Given the clusters from the previous clustering step, the primitive extraction phase aims to extract reusable video primitives from the situation clusters. By “reusable” we refer to the “loosening of context” discussed earlier. We aim to extract temporal and spatial subsets of the video which can be used in a variety of contexts to tell many stories relating to the original captured experience. The output of this phase should be a pool of video clips which represent object histories. This phase requires human intervention to decide which video clips from the situation clusters are representative of the experience and which have communicative value. Storytelling In this final phase, a storyteller uses the video primitives extracted from the previous phase to tell a story to others. Within a virtual 3D stage space, video billboards representing the objects are placed in the environment and can be moved and activated by storytellers or participants in the space. Video billboards of objects can be activated in order to play back their object histories which were extracted in the primitive extraction phase. The current object configuration can be measured 4. Tetsuya Matsuguchi, Yasuyuki Sumi, and Kenji Mase. Deciphering interactions from spatio-temporal data. ISPJ SIGNotes Human Interface, (102), 2002. 5. Kimiko Ryokai and Justine Cassell. Storymat: A play space with narrative memories. In Intelligent User Interfaces, page 201, 1999. 7. Steve Whittaker. Things to talk about when talking about things. Human-Computer Interaction, 18:149–170, 2003. Storing and Replaying Experiences in Mixed Environments using Hypermedia Nuno Correia1 Luis Alves1 Jorge Santiago1 Luis Romero1,2 nmc@di.fct.unl.pt lma@di.fct.unl.pt jms@di.fct.unl.pt lmcr@di.fct.unl.pt 1 Interactive Multimedia Group, DI and CITI New University of Lisbon Portugal ABSTRACT This paper describes a model and tools to store and replay user experiences in mixed environments. The experience is stored as a set of hypermedia nodes and links, with the information that was displayed along with the video of the real world that was navigated. It uses a generic hypermedia model implemented as software components developed to handle mixed reality environments. The mechanisms for storing and replaying the experience are part of this model. The paper presents the goals of the system, the underlying hypermedia model, and the preliminary tools that we are developing. Keywords Store/replay user experience, mixed reality, hypermedia, video. INTRODUCTION Storing photos, videos, and objects that help to remember past experiences is an activity that almost everyone has done at some point in their lives. Sometimes these materials are also augmented with annotations that help to remember or add personal comments about the situation and events that took place. The content that is stored is mostly used to remember the events but also to compose them in new ways and create new content. This activity is becoming increasingly dependent of technological support and multiple media can currently be used. In a mixed reality environment, where users are involved in live activities, the replay and arrangement of such experiences is definitely a requirement. In mixed reality, people can participate in gaming or exploration activities either alone or involving other people and this is a perfect setting for generating interesting activities that people want to remember at a later time. Previous work in this area, in mixed reality, includes [7]. Other related systems that help to store and retrieve previous related work involving storing user annotations or repurposing of captured materials include [2, 5, 6]. The 2 School of Technology and Management Viana do Castelo Polytechnic Institute Portugal Ambient Wood project described in [7] introduces an augmented physical space to enable learning experiences by children that take readings of moisture and light levels. The activities of the children that participate are recorded in log files. These log files are later replayed to enable further reflection on the experiences they had in the physical augmented outdoor environment. In [6] the authors present a system for capturing public experiences and personalizing the results. The system accepts the different streams that are generated, including speaker slides and notes, student notes, visited Web pages, and it stores all this information along with the timestamps that enable synchronization. The playback interface has features for rapid browsing enabling to locate a point of interest in the streams that were captured. VideoPaper [2] is a system for multimedia browsing, analysis, and replay. It has been used for several applications including meetings, news, oral stories and personal recordings. The system captures audio and video streams and key frames if slides are used. VideoPaper uses this data to produce a paper document that includes barcodes that can give access to digital information. This information can be accessed in a PDA or in a PC connected to a media server. The SHAPE project [3] had the goal of designing novel technologies for interpersonal communication in public places, such as museums and galleries. The users can learn about antique artifacts and their history. Although the main focus of the system is not on replay of experiences some of the features are related. The users can search for objects in a physical setting and their positions are tracked. Later they can continue their exploration and obtain more information about the objects that they were searching in a mixed reality setting, with projection screens. This area of research, storing the memory of past experiences, was also identified as one of the “Grand Challenges for Computing Research” [1]. The workgroup that produced the report identified several problems related with data storage and analysis, interaction with sensors, human computer interaction, that will shape future research in this area and that are key questions for the work that we describe here. This paper presents an approach for storing the user experience using hypermedia structures, much in the way the history mechanism (that allows accessing previously visited nodes) is used in Web browsers. In this case the activities take place in the real world, may involve accessing digital documents and entering and navigating in virtual worlds. The paper presents the scenario of usage, the underlying hypermedia model, the specific mechanism for storing/replaying, and the tools that we are developing to support these mechanisms. Events SCENARIO The position of a user in the space can also define an interest point. If the space has several subspaces (rooms, floors) moving from one to another will generate an event. Whenever an event is generated, new information is displayed, and the interface changes. A location event does not necessarily generate a change in the information and it can also occur in virtual spaces. In the physical space, where the user is, there are interest points that are detected by the system. When one of these points is detected new information is displayed in the mobile device of the user. When this point of interest is no longer detected the information ceases to be available unless this was a manual choice from the user. An information block that is displayed, as a result of an event, can be browsed by the user, thus originating a change in the content. Each navigation action made by the user creates a new event. The scenario of usage is a physical space, e.g. a museum, an art gallery, or even an outdoor space, where there are several interest points, e.g. paintings, objects, detected by the system. The user carries a portable wearable system that is able to capture the video of the real world scene, detect close objects or within the field of view, access a database using a wireless network, and display additional information over the real video. If the option for storing the user experience is on, when the user moves around the space the video is being captured along with the information about interest points. The information presented at each interest point and thus stored for later replay can be video, audio, text, or images, or virtual worlds. The user experience involves the visualization of the physical space, the augmented information, and the navigated virtual worlds. All this data is stored as a hypermedia network using the mechanisms described in the next sections HYPERMEDIA MODEL Storing and displaying information is supported by a hypermedia model defined by a set of reusable components for application programming. The model includes the following types of components: • Atomic: It represents the basic data types, e.g., text and image. • Composite: It is a container for other components, including Composites, and it is used to structure an interface hierarchically. • Link: It establishes relations among components. Every component includes a list of Anchors and a Presentation Specification. Anchors allow to reference part of a component and are used in specifiers, a triplet consisting of anchor, component and direction, used in Links to establish relations between the different components of a hypermedia graph. The Presentation Specification describes the way the data is presented in an augmented interface. The interface structure is done with Composite objects that establish a hierarchy of visual blocks. Interfaces are presented (and removed) according to the sequence of events, as described next. Anything that happens and that it changes the information that is presented is considered and event. There are three main types of events as follows: • Location of user in a space. • Recognition of an interest point, identified by an optical marker or a RFID tag. • User navigation or choice. APPLICATIONS We are developing several applications to test the hypermedia model in context aware augmented environments. The two applications where the development is more advanced are a museum/gallery information assistant and a game that takes place in the gallery environment. The gallery experimental space consists of room with subdivisions to create a navigational need. The physical entities consist of paintings, and each painting is positioned in a different part of the room. Virtual 3D models related to these paintings have been created and used to enrich the information setting, augment the user interface and allow navigation within those worlds in search of new experiences and knowledge. The user set up consists of a portable PC, with a wireless LAN card, and a camera to capture the real world video. There are two alternative user set-ups: in the first the visualization and interaction is done directly on the PC; and the other uses a Head Mounted Display, and a 2-3-button device for interaction purposes. We are currently using the Cy-Visor DH-4400VP video see through display. The main recognition process is accomplished through the camera device. There are markers associated with each painting that are optically recognized through an augment reality toolkit (ARToolkit) developed at the University of Washington. The system uses this recognition process to know the user position and orientation, although the components of the hypermedia model can accept input from different devices for location purposes. Once objects are recognized, media data is added to the real world video capture, by accessing the remote hypermedia graph. When manipulating 3D data, such as the worlds that represent the paintings, a 3D behavior toolkit is used to superimpose the models over the real world video and navigate in them. There is one ARToolkit marker located near each painting. When this marker is recognized the system presents information about the painting and an iconic simplified 3D representation. If the user selects this model it will enter a complex and detailed virtual world representing the painting where navigation is possible and the game described next takes place. Adding to the gallery information setting, a mystery game was also developed. The story consists in solving a robbery that took place in the gallery. The user has to gather clues and interact with virtual characters to find the stolen item. To do so, the player has to move around the physical and virtual spaces. The game features several objects to be accessed or navigated during playtime, namely worlds, characters and clues. This method of specifying history through a sequence of scenes yields to obvious possibilities of arranging it in a different order or introducing new media elements. This is a generic mechanism for repurposing these types of materials and building new applications (e.g., storytelling). Time Story Entity Action Duration Story Action Entity ... Duration Scene Start Content Video Content Figure 2: History Structure Storage Requirements It is necessary to store different sets of data in order to represent and later replay the user experience: • Video of the real world scenes • Events • Information that is displayed The main storage requirement is related with the video capture. This has to be continuously stored when the events are occurring. The only exception happens when the user navigates in virtual worlds. The real world video is stored in streams and it is referenced by Entity component, through Anchors and Content components, for each event that occurs. The streams are interrupted whenever the interface is a virtual world where the user navigates. Figure 1: Application Architecture HISTORY STRUCTURE In a context aware application, the experience can be divided into several scenes, each of which triggered by a particular action. Each of such scenes has two main elements: the content of the interface, and the action that triggered the interface. Associated with the action is also its life period. Each scene is presented before another action is processed that leads to another interface content. For instance, in an augmented reality environment, the real video is needed to replay the experience, as well as the augmented information. The action is the command (or event) that caused the augmented content to be displayed. The experience history is build up of several scenes. The hypermedia system models scenes with the Story and Entity components. Each scene is a Story link that points to an Entity component. The Story link contains the action and duration attributes. The Entity component is associated to a set of links that specify the data elements needed to replay the scene. The result of navigating in the system, a history instance, is a linked list of Story/Entity component pairs. The Entity components are linked together by Story components, forming a path of links and nodes. Figure 2 illustrates this structure. The Story components in the scene list contain the events and the associated information. For each event a new Story component is created with the event parameters. Simultaneously an Entity component is created with the necessary Content connections to reproduce the state of the interface after processing the event. The state of the interface is a set of Content connections with the corresponding dynamic behaviors, for each information item displayed in the interface at a given instant. The necessary data for later replay are copied from the original components and referenced in Content links through specifiers. The behavior of the interface is reproduced with the Presentation Specification of the Content links. REPLAYING THE EXPERIENCE In order to replay the experience we are developing a set of applications. These applications assume that as a result of a previous navigation session a hypermedia graph, as described above, was produced and added to the main graph, that contains the overall information (Figure 3). This hypermedia graph describes the experience, including video, events and user interfaces. It includes all the structural and timing information needed to provide a view of a past experience in an augmented environment. In order to replay the experience we are considering different levels and tools, described in the next subsections. video, augmented information, and navigation in virtual worlds into a coherent narrative structure. PC Player Authoring Environment Player/editor of the experiences stored as a hypermedia history, for later browsing. This is a tool that enables to browse the stored materials at a later stage, in a setting that can be different from the one where the experience took place. The typical usage is on someone’s personal computer. This is the first tool that we are implementing and it is essentially a video player that displays the video that was captured during the navigation in the real world. Superimposed to this video, the augmented materials that were presented in the original navigation are presented. Besides the traditional video player control buttons (play, stop, pause, resume) it includes buttons for Next/Previous, which allows going back and forth in the history list. This navigation mode is exactly the same as going to the next/previous interest point that was defined in the physical space. Additionally, the tool allows adding annotations to a given point in the video stream. Besides these applications we are building an authoring environment for mixed reality applications that will also be used to browse and edit hypermedia networks that resulted from previous experiences. This allows to add additional materials at a later stage, for example, to provide more insights about a given interest point. This corresponds to add more components to the hypermedia graph or remove them. The authoring environment includes a graph browser and space representations (2D and 3D). It allows attaching Entity components to physical spaces and additional content. Current User History Hypermedia Graph User 1 Player User 2 User n Movie Mixed Reality Figure 3: Information storage and repurpose scheme Mixed Reality Player This player allows to access stored experiences when the user is in the physical setting. When the user reaches an interest point it can follow the links for further information or it can access previous content that she or others have browsed at that time. This mechanism can be used in gaming settings or it can also be used to leave personal information attached to physical spaces, as memory for future visitors. Movie The navigation in the real world combined with access to virtual worlds can be viewed as a movie. As such, we intend to explore this option by defining a set of montage and editing rules that can be applied to the overall hypermedia network in order to generate a movie. This movie will integrate the different elements: the original CONCLUSIONS AND FUTURE WORK This paper presents an approach for replaying experiences in mixed reality using hypermedia mechanisms. The main advantage of using hypermedia as support for this type of activities comes from the fact that hypermedia mechanisms provide a powerful and well-tested way to structure information and provide support for navigation. When extending the existing hypermedia mechanisms to the real world many concepts can be used including navigation aids, annotations or bookmarks, and path/history mechanisms. The history mechanism, common in most hypermedia systems, namely the Web browsers, is the main concept that supports the work that we are doing. The history, as list of visited nodes, provides a simple and flexible mechanism for structuring information captured from the real world along with virtual elements. The applications that we are building explore different ways to replay the events, ranging from a player to be used in a normal PC to exploration in the place where past events took place. Additionally, we are exploring the possibility of generating a movie out of the raw materials that were captured and stored. The current status of the system includes an implementation of the hypermedia model, testing applications (the information and gaming environment mentioned above), and preliminary tools for replay (the player for later browsing, in a PC setting). Further work includes the tools for replaying the experience where it took place. Also, editing the result of a session is a way that we want to explore in the context of the authoring environment that we are building. Editing can be as simple as adding or removing materials, but it can include transforming and repurposing the materials using storytelling and cinematographic techniques. ACKNOWLEDGEMENTS We thank the financial support given by “PRODEP III, Medida 5, Acção 5.3”, a FSE education program. REFERENCES 1. Fitzgibbon, A., Reiter, E. “Memories for Life” – Managing Information Over a Human Lifetime. Included in Grand Challenges for Computing Research, Sponsored by the UK Computing Research Committee, with support from EPSRC and NeSC, http://www.nesc.ac.uk/esi/events/Grand_Challenges/ 4. Pea, R., Mills M., Rosen, J., Dauber, K., Effelsberg, W., Hoffert, E. The Diver Project: Interactive Digital Video Repurposing. IEEE Multimedia Jan/Mar 2004 2. Graham, J., Erol B., Hull, J., and Lee, D. The Video Paper Multimedia Playback System. ACM Multimedia 2003, Berkeley, USA, November 2003 5. Romero, L., and Correia, N. HyperReal: A Hypermedia Moduel for Mixed Reality. ACM Hypertext’03, Nottingham, UK, August 2003 3. Hall, T. et al. The Visitor as Virtual Archaeologist: Explorations in Mixed Reality Technology to Enhance Educational and Social Interaction in the Museum. Proceedings of the 2001 conference on Virtual reality, archeology, and cultural heritage, Glyfada, Greece, 2001 6. Truong, K., Abowd, G., and Brotherton, J. Personalizing the Capture of Public Experiences. UIST’99, Asheville, USA, November 1999 7. Weal, M., Michaelides, D., Thompson, M., and De Roure, D. The Ambient Wood Journals – Replaying the Experience. ACM Hypertext’03, Nottingham, UK, August 2003 Storing, indexing and retrieving my autobiography Alberto Frigo Innovative Design, Chalmers University of Technology 412 96 Gothenburg, Sweden it2frx@ituniv.se ABSTRACT This paper describes an ongoing experiment consisting of photographing each time my right hand uses an object in order to create my autobiography for self-reflection and enforcing my identity. This experiment has now been carried out for six months. The daily sequences of photos are linked together on a portable computer based on the typology of the object represented. With this structure I can review the database and retrieve a record of my past activities both chronologically and in an associative manner. The portable database is also used to support communication with persons in my proximity. Finally I consider a scenario where several users carrying such database could interact with one another. Keywords Autobiography, photography of object engagement, object typologies, portable database. INTRODUCTION Since the 24th of September 2003 I have been photographing and cataloguing each time my right hand has used an object. The images are chronologically collected in a portable database [1] I constantly carry with me. Here they are linked to one another based on the object represented. The visualization of the database attempts to codify the whole of my life patching together and associating every single event. Each of my life-events finds a representative symbol in the images of the objects used while accomplishing it. These symbols are meant to be a direct stimulus for me to actively remember my past. As I am reviewing this paper (24th of March 2004), 9536 activities from 181 days have been indexed in 124 categories stored on my portable database. The experience of writing this paper has been stored by photographing my hand using my computer mouse (see the first image of the third row in fig. 2). Later I will retrieve today’s images to quickly index them with a binary code of eight digits corresponding to a matrix of icons representing the objects (fig.1). 57<8,2 0001 0010 0100 1001 1011 1101 )*+-/3g 0011 0110 1100 1000 0101 0000 1010 9%1 0111 1110 1111 fig. 1 By combining an icon from the matrix on the left with an icon from the matrix on the right I can assign to every typology of objects eight binary digits which are different for each category and easy to remember. On the subway on my way to work I am likely to extract my database from my pocket and browse it, to recollect my history and myself. I might also retrieve it in front of a colleague of mine asking me: “How are you?” and my answer: “Good!” would be accompanied by a rapid slide-show of the recent photographed objects I have engaged with. The photo of my hand engaging with the database on the subway would appear as well (see the first image of the first row in fig. 3). BACKGROUND Throughout history humans have been using images as cognitive tools to assist memory. For this purpose images have been both mentally internalized and externalized as for instance in an alphabet (fig. 4). fig. 2 The images are organized per typology of object starting left to right with the most recent photographed engagement. fig. 3 The daily sequence of objects used on the 12th of February 2004, see http://www.id.gu.se/~alberto/12.02.04.html, Besides, as life events within an inorganic context can repeat themselves identically, the same repetition of a typology of object can establish a link between situations distant in time but where I had to follow an identical procedure. An example of this would be that of myself brushing my teeth after every meal. Those identical situations might frequently look redundant though the photographic representation of this activity can reveal some unusual situations. An example could be that of a strangely shaped toothbrush that immediately reminds me that I used it at a friend’s place when a snow storm prohibited me to drive home so I was invited to stay over and I was given a brand new toothbrush. fig. 4 A 15th century example of an alphabet where the representation of objects were used by a priest to remember his sermons. In both cases, whether the memory image is internal or external, a sequential order to move from one image to another is needed. In the practice of the Ars Memorativa this order was given by mentally dispersing the images within a familiar architecture, to then mentally move from one room to another in a predetermined way [2]. MOTIVATIONS My method of photographing each object I engage with was created as a medium to recollect myself and my personal history, which I see as very fragmented, interrupted by a technology that allows different realities, different selves that I am not able to express as a whole. In today’s artificially mediated reality, a continuous narration of my life is not possible, too many inorganic interruptions have occurred. The inorganic interruptions can be symbolized by the objects I voluntary or involuntary, consciously or unconsciously engage with. The objects are the artificial tools that allow me to access different contexts. For example a magnetic card allows me to enter the gym where I will exercise and become a macho, the mobile phone allows me to contact my relatives in my native country and suddenly re-become part of them, a pen allows me to sign a contract that allows me to get an apartment in an upper class neighborhood and so changing my social status. The objects I have been engaging with throughout my life symbolize those drastic, therefore inorganic changes. The objects can then be seen as joining together life events that most likely have no organic connections between one another. If my life and my effort of existing is worth to be remembered and communicated, a visual inventory of those objects is, in my opinion, its most accurate and immediate representation. The objects I photograph, while used, represent single specific activities that from a more general perspective can visualize how, throughout my life, my intentions, my desires, my sorrows have mutated. The objects become my emblems, the code through which the whole of me can be reconstructed, interpreted. POTENTIAL The utility of this mnemonic mechanism can be found in the way it provides a language of interpretation of a person’s life. Through genetic code we are able to trace our organic evolution yet within an inorganic context, genetic code is not sufficient to trace what I would call our inorganic evolution, our artificially mediated way of being. On a macro level the sequences of objects a hypothetic person might have used, could be examined based on their frequency. On a micro level each object represents an objective landmark to the psychological, physical state of this person around the moment in which the object was used. THE DESIGN For the physical design of the database the concept was to make it self-contained, portable and self-sustainable. The result is a mixture of few commercially available electronic accessories that consist of: • A low cost digital camera, and its battery and charger. • A handheld pc and its charger. • A memory card. protective, practical to use and easy to wear both in the labeling and showing mode. fig. 5 The camera and the pocket PC. The memory card allows me to transfer the images from the camera to the handheld pc where an application developed within the project will link them to both the day sequence (fig. 3) and the object sequence (fig. 2) via the eight digits I input (fig. 1). fig. 7 The bag. CONTEXTUALISATION My method should be referred as a strictly visual attempt to build a memory-aid. Other research has been carried out on the same agenda, also exploring wearable and portable technologies. However this research has primarily focused either on text based approaches as in the case of Augmented Reality where the user annotates conversations [3], or on a continuous video and/or audio recording of reality as in the case of wearable surveillance devices [4]. My approach differs from these by being selective in time. This selection is neither completely subjective nor completely objective. The usage of an object tells me exactly that is time to capture yet it’ s my decision to use it. fig. 6 The figure illustrates how the handheld PC buttons are arranged both to catalogue and show images. The central button of the handheld pc (fig. 6) allows to slide show the images chronologically when going left or right (respectively more remote and more recent). When showing the image of an object, by pressing up or down I can slide show the images of the same type of object respectively captured after or before. This last method associates those life-events symbolized by the same type of object engagement. The buttons 0 and 1 allows me to input the eight digits correspondent to a category (fig. 1) and by pressing + in the central button, I can retrieve the most recent image of this. In the same way images are labeled. By pressing the white button I switch to label mode. In this mode the application picks the first image stored in the memory card of the portable camera and shows it on screen. The application, after inputting 0 and/or 1 digits and pressing +, updates the database and proposes a new image on screen until there are no more. With the black button I can delete the eight digits label just inputted or in the show mode exit the program. The whole configuration is fitted in a pouch around my waist (fig 7). The pouch has been adapted so to be My approach is not a substitute for my memory but it assists it by showing a sequential order of activities that triggers an internal recollection of the situation around them. On the same level a related portable system was designed at Xerox Parc [5] a decade ago. This system collected graphic icons based on the user activities within an office space. Although the activities that could be tracked were limited and the graphic icons repetitive, this project was an important predecessor of my work. CONCEPT IMPLEMENTATION In a scenario where a whole crowd of people have their own portable database containing the objects representing their activities, I can imagine persons encountering each other and quickly getting an overview of their objects’ similarities or differences and perhaps distributing themselves according to those. If I for instance dislike smokers, I would try to avoid persons whose database contains many lighters and instead approach those that have swim goggles (I like to swim). Still I would be really curious about the persons around me, diminishing the sense of alienation typical of metropolitans. The sum of each of our autobiographical databases could become the medium of a global history, an authentic record where it would be possible to determine our present consequences and perhaps even predict future ones. It is my opinion that this distributed method, where each individual in society would have to contribute to a global history, is both a more concrete and less overwhelming possibility than a surveillance system distributed in the environment. CONCLUSION To conclude I would like to stress my position that the design of a memory-aid and sharing of experiences should involve a high degree of participation from the user without aiming to simulate his or her life. On the contrary this design should be of existential and intellectual stimulus. Existential stimulus in the sense that the user should look for exciting experiences worth to be remembered and narrated, intellectual stimulus in the sense that he or she, while retrieving the recorded experiences, should contribute with his or her own memory. REFERENCES 1. To access my current http://www.id.gu.se/~alberto/ project please visit: 2. Frances A. Yates, The art of memory, The University of Chicago Press, Chicago 1966. fig. 8 A visualization of possible interactions between persons carrying the database and encountering. 3. Thad Starner, Steve Mann, Bradley Rhodes, Jeffrey Levine, Jennifer Healey, Dana Kirsch, Rosalind W. Picard, Alex Pentland, Augmented Reality Through Wearable Computing, Presence, Vol. 6, No. 4, August 1997, pp. 386-398. 4. Steve Mann, Wearable Computing: A First Step Toward Personal Imaging, Computer, February 1997, pp. 2531. 5. Mik Lamming, Mike Flynn, "Forget-me-not" Intimate Computing in Support of Human Memory, FRIEND21 Symposium, Next Generation Human Interfaces, Tokyo Japan, 1994. Sharing Experience and Knowledge with Wearable Computers Marcus Nilsson, Mikael Drugge, Peter Parnes Division of Media Technology Department of Computer Science & Electrical Engineering Luleå University of Technology SE-971 87 Luleå, Sweden {marcus.nilsson, mikael.drugge, peter.parnes}@ltu.se ABSTRACT Wearable computer have mostly been looked on when used in isolation. But the wearable computer with Internet connection is a good tool for communication and for sharing knowledge and experience with other people. The unobtrusiveness of this type of equipment makes it easy to communicate at most type of locations and contexts. The wearable computer makes it easy to be a mediator of other people knowledge and becoming a knowledgeable user. This paper describes the experience gained from testing the wearable computer as a communication tool and being the knowledgeable user on different fairs. Keywords Group communication, wearable computer. INTRODUCTION Wearable computer can today be made by of the shelf equipment and are becoming more common used in some areas as construction, health care etc. Researchers in the wearable computer area believe that wearable computer will be equipment for everyone that aids the user all day. This aid is in areas where computers are more suited then humans for example memory task. Wearable computer research has been focusing on the usage of wearable computer in isolation [5]. It is believed in the Media Technology group at Luleå University of Technology that a big usage of the wearable computer will be the connection the wearable computer can make possible, both with people and the surrounding environment. Research on this is being conducted in what we call Borderland[12], which is about wearable computer and the tool for it to communicate with people and technology. A wearable computer with network connection can make it possible to have a communication with people that are at distant locations independent of the users current location. This is of course possible today with mobile phones etc, but a signifi- cant difference with the wearable computer is the possibility of a broader use of media and the unobtrusiveness of using a wearable computer. One of the goals for wearable computers is that the user could operate it without diminishing his presence in the real world [4]. This together with the wearable computer as a tool for rich1 communication make it possible for new ways of communication. A wearable computer user could become a beacon of several people’s knowledge and experience, a knowledgeable user. The wearable computer would not just be a tool for receiving expert help [8] but a tool to give the impression to other people that the user does have the knowledge in himself. The research questions this brings forward include by what means communication can take place, what type of media is important for this type of communication? There is also the question of how this way of communicating will affect the participants involved, what advantages and disadvantages there are with this form of communication. In this paper we present experience that have been made on using wearable computers as a tool to communicate knowledge and experience from both the user and other participants over the network or locally. Environment for Testing The usage of wearable computers for communication was tested under different fairs that the Media Technology group attended. The wearable computer was part of the exhibition of the group and used to communicate with the immobile part of the exhibition. Communication was also established with remote persons from the group that was not attending the fairs. Both the immobile and remote participants could communicate with the wearable computer through video, audio and text. The type of fairs ranged from small fairs locally to the university for attracting new students, to bigger fairs where research was presented for investors and other interested parties. 1 With rich we mean that several different media is used as audio, video, text, etc RELATED WORK Collaborative work using wearable computers has been discussed in several publications [2, 3, 13]. The work has focused on how several wearable computers and/or computer users can collaborate. Not much work has been done on how the wearable computer user can be a mediator for knowledge and experience of other people. Lyons and Starners work on capture the experience of the wearable computer user [10] is interesting and some of the work there can be used for sharing knowledge and experience in real time. But it is also important to consider the other way around where people are sharing to the wearable computer user. As pointed out in [5], wearable computers tend to be most often used in isolation. We believe it is important to study how communication with other people can be enabled and enhanced by using this kind of platform. THE MOBILE USER We see the mobile user as one using a wearable computer that is seamlessly connected to the Internet throughout the day, regardless of where the user is currently situated. In Borderland we currently have two different platforms which both enable this; one is based on a laptop and the other is based on a PDA. In this section we discuss our current hardware and software solution used for the laptop-based prototype. This prototype is also the one used throughout the remainder of this paper, unless explicitly stated otherwise. Hardware Equipment The wearable computer prototype consists of a Dell Latitude C400 laptop with a Pentium III 1.2 GHz processor, 1 GB of main memory and built-in IEEE 802.11b. Connected to the laptop is a semi-transparent head-mounted display by TekGear called the M2 Personal Viewer, which provides the user with a monocular full color view of the regular laptop display in 800x600 resolution. Fit onto the head-mounted display is a Nogatech NV3000N web camera that is used to capture video of what the user is currently looking or aiming his head at. A small wired headset with an earplug and microphone provides audio capabilities. User input is received through a PS/2-based Twiddler2 providing a mouse and chording keyboard via a USB adapter. The laptop together with an USB-hub and a battery for the head-mounted display are placed in a backpack for convenience of carrying everything. A battery for the laptop lasts about 3 hours while the head-mounted display can run for about 6 hours before recharging is needed. What the equipment looks like when being worn by a user is shown in figure 1. Note that the hardware consists only of standard consumer components. While it would be possible to make the wearable computer less physically obtrusive by using more specialized custom-made hardware, which is not a goal in itself at this time. We do, however, try to reduce its size as new consumer components become available. There is work being done on a PDA based wearable that can be seen in figure 2. The goal is that it will be much more useful outside the Media Technology group at Luleå Figure 1: The Borderland laptop-based wearable computer. University of Technology and by that make it possible to do some real life test on the knowledgeable user. Software Solution The commercial collaborative work application Marratech Pro2 running under Windows XP provides the user with the ability to send and receive video, audio and text to and from other participants using either IP-multicast or unicast. In addition to this there is also a shared whiteboard and shared web browser. An example of what the user may see in his head-mounted display is shown in figure 3. BEYOND COMMUNICATION With a wearable computer, several novel uses emerge as a side effect of the communication ability that the platform allows. In this section we will focus on how knowledge and experiences can be conveyed between users and remote participants. Examples will be given on how this sharing of information can be applied in real world scenarios. Becoming a Knowledgeable User One of the key findings at the different fairs was how easily a single person could represent the entire research group, 2 http://www.marratech.com Figure 3. The collaborative work application Marratech Pro as seen in the head-mounted display. provided he was mobile and could communicate with them. When meeting someone, the wearable computer user could ask questions and provide answers that may in fact have originated from someone else at the division. As long as the remote information, e.g. questions, answers, comments and advices, was presented for our user in a non-intrusive manner, it provided an excellent way to make the flow of information as smooth as possible. For example, if a person asked what a certain course or program was like at our university, the participants at the division would hear the question as it was asked and could respond with what they knew. The wearable computer user then just had to summarize those bits of information in order to provide a very informative and professional answer. This ability can be further extended and generalized as in the following scenario. Imagine a person who is very charismatic, who is excellent at holding speeches and can present information to an audience in a convincing manner. However, lacking technical knowledge, such a person would not be very credible when it comes to explaining actual technical details that may be brought up. If such a person is equipped with a wearable computer, he will be able to receive information from an expert group of people and should thus be able to answer any question. In effect, that person will now know everything and be able to present it all in a credible manner, hopefully for the benefit of all people involved. Further studies are needed to find out whether and how this scenario would work in real life — can for example an ex- ternal person convey the entire knowledge of, for example a research group, and can this be done without the opposite party noticing it? From a technical standpoint this transmission of knowledge is possible to do with Borderland today, but would an audience socially accept it or would they feel they are being deceived? Another, perhaps more important, use for this way of conveying knowledge is in health-care. In rural areas there may be a long way from hospital to patients’ homes, and resources in terms of time and money may be too sparse to let a medical doctor visit all the patients in person. However, a nurse who is attending a patient in his home can use a wearable computer to keep in contact with the doctor who may be at a central location. The doctor can then help make diagnoses and advise the nurse on what to do. He can also ask questions and hear the patient answer in his own words, thereby eliminating risks of misinterpretation and misunderstanding. This allows the doctor to virtually visit more patients than would have been possible using conventional means, it serves as an example on how the knowledge of a single person can be distributed and shared over a distance. Involving External People in Meetings When in an online meeting, it is sometimes desirable for an ordinary user to be able to jump into the discussion and say a few words. Maybe a friend of yours comes by your office while you are in a conversation with some other people, and you invite him to participate for some reason, maybe he the headset. 3 To alleviate this problem, we found it would likely be very useful to have a small speaker as part of the wearable computer through which the persons you meet could hear the participants. That way, the happenstance meeting can take place immediately and the wearable computer user need not even take part in any way, he just acts as a walking beacon through which people can communicate. Of course, a side effect of this novel way of communicating may well be that the user gets to know the other person as well and thus, in the end, builds a larger contact network of his own. We believe that with a mobile participant, this kind of unplanned meetings will happen even more frequently. Imagine, for example, all the people you meet when walking down a street or entering a local store. Being able to involve such persons in a meeting the way it has been described here may be very socially beneficial in the long run. When Wearable Computer Users Meet Besides being able to involve external persons as discussed in the section before, there is also the special case of inviting other wearable computer users to participate in a meeting. This is something that can be done using the Session Initiation Protocol (SIP)[7]. Figure 2: The Borderland PDA-based wearable computer. knows a few of them and just wants to have a quick chat. While this is trivial to achieve when at a desktop — you just turn over your camera and hand a microphone to your friend — this is not so easily done with a wearable computer for practical reasons. Even though this situation may not be that common to deserve any real attention, we have noticed an interesting trait of mobile users participating in this kind of meetings. The more people you meet when you are mobile, the bigger chance there is that some remote participant will know someone among those people, and thus the desire for him to communicate with that person becomes more prevalent. For this reason, it has suddenly become much more important to be able to involve ordinary users — those you just meet happenstance — in the meeting without any time to prepare the other person for it. A common happening at the different fairs was that the wearable computer user met or saw a few persons who some participant turned out to know and wanted to speak with. Lacking any way besides using the headset to hear what the remote participants said, the only way to convey information was for our user to act as a voice buffer, repeating the spoken words in the headset to the other person. Obviously, it would have been much easier to hand over the headset, but several people seemed intimidated by it. They would all try on the head-mounted display, but were very reluctant to speak in A scenario that exemplifies when meetings between several wearable computer users at different locations would be highly useful is in the area of fire-fighting.4 When a fire breaks out, the first team of firefighters arrives at the scene to assess the nature of the fire and proceed with further actions. Often a fire engineer with expertise knowledge arrives at the scene some time after the initial team in order to assist them. Upon arrival he is briefed of the situation and can then provide advice on how to best extinguish the fire. The briefing itself is usually done in front of a shared whiteboard on the side of one of the fire-fighting vehicles. Considering the amount of time the fire engineer spends while being transported to the scene, it would be highly beneficial if the briefing could start immediately instead of waiting until he arrives. By equipping the fire engineer and some of the firefighters with wearable computers, they would be able to start communicate early on upon the first team’s arrival. Not only does this allow the fire engineer to be briefed of the situation in advance, but he can also get a first person perspective over the scene and assess the whole situation better. Just as in kraut’s work [9] the fire engineer as an expert can assist the less knowledgeable before reaching the destination. As the briefing is usually done with help of a shared whiteboard — which also exists in the collaborative work application in Borderland — there would be no conceptual change to their work procedures other than the change from a physical 3 Another exhibitor of a voice-based application mentioned they had the same problem when requesting people to try it out; in general people seemed very uncomfortable speaking into unknown devices. 4 This scenario is based on discussions with a person involved in fire fighting methods and procedures in Sweden. whiteboard to an electronic one. This is important to stress — the platform does not force people to change their existing work behavior, but rather allows the same work procedures to be applied in the virtual domain when that is beneficial. In this case the benefit lies in briefing being done remotely, thereby saving valuable time. It may even be so that the fire engineer no longer needs to travel physically to the scene, but can provide all guidance remotely and serve multiple scenes at once. In a catastrophe scenario, this ability for a single person to share his knowledge and convey it to people at remote locations may well help in saving lives. EVALUATION The findings we have done are based on experiences from the fairs and exhibitions we have attended so far, as well as from pilot studies done in different situations at our university. The communication that the platform enables allows for a user to receive information from remote participants and convey this to local peers. As participants can get a highly realistic feeling of “being there” when experiencing the world from the wearable computer user’s perspective, the distance between those who possess knowledge and the user who needs it appears to shrink. Thus, not only is the gap of physical distance bridged by the platform, but so is the gap of context and situation. While a similar feeling of presence might be achieved through the use of an ordinary video camera that a person is carrying around together with a microphone, there are a number of points that dramatically sets the wearable computer user apart from such. • The user will eventually become more and more used to the wearable computer, thus making the task of capturing information and conveying this to other participants more of a subconscious task. This means that the user can still be an active contributing participant, and not just someone who goes around recording. • As the head-mounted display aims in the same direction as the user’s head, a more realistic feeling of presence is conveyed as subtle glances, deliberate stares, seeking looks and other kinds of unconscious behavior is conveyed. The camera movement and what is captured on video thus becomes more natural in this sense. • The participants could interact with the user and tell him to do something or go somewhere. While this is possible even without a wearable computer, this interaction in combination with the feeling of presence that already existed gave a boost to it all. Not only did they experience the world as seen through the user’s eyes, but they were now able to remotely “control” that user. The Importance of Text Even though audio may be well suited for communicating with people, there are occasions where textual chat is more preferable. The main advantage of text as we see it is that unlike audio, the processing of the information can be postponed for later. This has three consequences, all of which are very beneficial for the user. 1. The user can choose when to process the information, unlike a voice that requires immediate attention. This also means processing can be done in a more arbitrary, nonsequential, order compared to audio. 2. The user may be in a crowded place and/or talk to other people while the information is received. In such environments, it may be easier to have the information presented as text rather than in an audible form, as the former would interfere less with the user’s normal task. 3. The text remains accessible for a longer period of time meaning the user does not need to memorize the information in the pace it is given. For things such as URL:s, telephone numbers, mathematical formulas and the like, a textual representation is likely to be of more use than the same spoken information. While there was no problem in using voice when talking with the other participants, on several occasions the need to get information as text rather than voice became apparent. Most of the time, the reason was that while in a live conversation with someone, the interruption and increased cognitive workload placed upon the user became too difficult to deal with. In our case, the user often turned off the audio while in a conversation so as not to be disturbed. The downside of this was that the rest of the participants in the meeting no longer had any way of interacting or providing useful information during the conversation. 5 There may also be privacy concerns that apply; a user standing in a crowd or attending a formal meeting may need to communicate in private with someone. In such situations, sending textual messages may be the only choice. This means that the user of a wearable computer need not only be able to receive text, he must also be able to send it. We can even imagine a meeting with only wearable computer participants to make it clear that sending text will definitely remain an important need. Hand-held chord keyboards such as the Twiddler have showed to give good result for typing [11]. But these types of devices still take time to learn and for those who seldom need to use them the motivation to learn typing efficiently may never come. Other alternatives that provide a regular keyboard setup, such as the Canesta KeyboardTM Perception ChipsetTM that uses IR to track the user’s fingers on a projected keyboard, also exist and may well be a viable option to use. Virtual keyboards shown on the display may be another alternative and can be used with a touch-sensitive screen or eye-tracking software in the case of a head-mounted display. Voice recognition systems translating voice to text may be of some use, although these will not work in situations where 5 This was our first public test of the platform in an uncontrolled environment, so neither of the participants was sure of what was the best thing to do in the hectic and more or less chaotic world that emerged. Still, much was learnt thanks to exactly that. privacy or quietness is of concern. It would, of course, also be possible for the user to carry a regular keyboard with him, but that can hardly be classified as convenient enough to be truly wearable. There is one final advantage of text compared to audio, and that is the lower bandwidth requirements of the former compared to the latter. On some occasions there may simply not be enough bandwidth, or the bandwidth may be too expensive, for communicating by other means than through text. Camera and Video Opinions about the placement of the camera on the user’s body varied among the participants. Most of them liked having the camera always pointing in the same direction as the user’s head, although there were reports of becoming disoriented when the user turned his head too frequently. Some participants wanted the camera to be more body stabilized, e.g. mounted on the shoulder, in order to avoid this kind of problem. While this placement would give a more stable image it may reduce the feeling of presence as well as obscure the hints of what catches the user’s attention. In fact, some participants expressed a desire to be given an even more detailed view of what the user was looking at by tracking his eye movements, as that is something which can not be conveyed merely by having the camera mounted on the user’s head. As Fussell points out [6] there are problems that have to be identified with head-mounted cameras. Some of these problems may be solved by changing the placement on the body for the camera. However, further studies are needed to draw any real conclusions of the effects of the different choices when used in this kind of situation. Some participants reported a feeling of motion sickness with a framerate (about 5 Hz), and for that reason preferred a lower framerate (about 1 Hz) providing almost a slideshow of still images. However, those who had no tendency for motion sickness preferred as high framerate as possible because otherwise it became difficult to keep track of the direction when the user moved or looked around suddenly. In [1] it is stated that a high framerate (15 Hz) is desirable in immersive environments to avoid motion sickness. This suggests our notion of high framerate was still too low, and by increasing it further it might have helped eliminate this kind of problem. Transmission of Knowledge Conveying knowledge to a user at a remote location seems in our experience to be highly useful. So far, text and audio have most of the time been enough to provide a user with the information needed, but we have also experienced a few situations calling for visual aids such as images or video. CONCLUSIONS We have presented our prototype of a mobile platform in form of a wearable computer that allows its user to communicate with other. We have discussed how remote participants can provide a single user with information in order to represent a larger group, and also how a single expert user can share the knowledge he possesses in order to assist multiple persons at a distance. The benefits of this sharing have been exemplified with scenarios taken from health-care and fire-fighting situations. The platform serves as a proof-ofconcept that this form of communication is possible today. Based on experiences from fairs and exhibitions, we have found and identified a number of areas that need further refinement in order to make this form of communication more convenient for everyone involved. The importance of text and the configuration and placement of video has been discussed. The equipment used in these trials is not very specialized and can be bought and built by anyone. The big challenges in wearable computers today are the usage and in this paper a usage of the wearable computer as a tool for sharing of knowledge and experience was presented. Future Work We currently lack quantitative measures for our evaluation. For this a wearable computer that ordinary people will accept to use in their everyday life is needed. It is believed that the PDA based wearable that was mentioned earlier in this paper is that kind of wearable computer and the plan is to do user test for some of the scenarios that have been mentioned in earlier in the paper. There are also plans to improve the prototype with more tools for improving sharing of experience and knowledge. One thing that is being worked on now is to incorporate a telepointer over the video so distant participants can share with the wearable computer user what they are talking about or what have their attention at the moment. ACKNOWLEDGEMENTS Microphone and Audio Audio was deemed as very important. Through the headset microphone the participants would hear much of the random noise from the remote location as well as discussions with persons the user met, thereby enhancing the feeling of “being there” tremendously Of course, there are also situations in which participants are only interested in hearing the user when he speaks, thereby pointing out the need for good silence suppression to reduce any background noise. This work was sponsored by the Centre for Distance-spanning Technology (CDT) and Mäkitalo Research Centre (MRC) under the VINNOVA RadioSphere and VITAL project, and by the Centre for Distance-spanning Health care (CDH). REFERENCES 1. B IERBAUM , A., AND J UST, C. Software tools for virtual reality application development, 1998. Applied Virtual Reality, SIGGRAPH 98 Course Notes. 2. B ILLINGHURST, M., B OWSKILL , J., J ESSOP, M., AND M ORPHETT, J. A wearable spatial conferencing space. In Proc. of the 2nd International Symposium on Wearable Computers (1998), pp. 76–83. 3. B ILLINGHURST, M., W EGHORST, S., AND F URNESS , T. A. Wearable computers for three dimensional CSCW. In Proc. of the International Symposium on Wearable Computers (1997), pp. 39–46. wearable computer system. ACM/Baltzer Journal on Mobile Networks and Applications (MONET) 4, 1 (1999). 9. K RAUT, R. E., M ILLER , M. D., AND S IEGEL , J. Collaboration in performance of physical tasks: Effects on outcomes and communication. In Computer Supported Cooperative Work (1996). 4. B REWSTER , S., L UMSDEN , J., B ELL , M., H ALL , M., AND TASKER , S. Multimodal ’eyes-free’ interaction techniques for wearable devices. In Conference on Human Factors in Computing Systems (2003), pp. 473–480. 10. LYONS , K., AND S TARNER , T. Mobile capture for wearable computer usability testing. In International Symposium on Wearable Computers (ISWC 2001) (October 2001), pp. 69–76. 5. F ICKAS , S., KORTUEM , G., S CHNEIDER , J., S EGALL , Z., AND S URUDA , J. When cyborgs meet: Building communities of cooperating wearable agents. In Proc. of the 3rd International Symposium on Wearable Computers (October 1999), pp. 124–132. 11. LYONS , K., S TARNER , T., P LAISTED , D., F USIA , J., LYONS , A., D REW, A., AND L OONEY, E. Twiddler typing: One-handed chording text entry for mobile phones. Technical report, Georgia Institute of Technology, 2003. 6. F USSELL , S. R., S ETLOCK , L. D., AND K RAUT, R. E. Effects of head-mounted and scene-oriented video system on remote collaboration on physical tasks. In CHI2003 (Arpil 2003). 12. N ILSSON , M., D RUGGE , M., AND PARNES , P. In the borderland between wearable computers and pervasive computing. Research report, Luleå University of Technology, 2003. ISSN 1402-1528. 7. H ANDLEY, M., S CHULZRINNE , H., S CHOOLER , E., AND ROSENBERG , J. SIP: session initiation protocol, March 1999. IETF RFC2543. 13. S IEGEL , J., K RAUT, R. E., J OHN , B. E., AND C ARLEY, K. M. An empirical study of collaborative wearable computer systems. In Conference companion on Human factors in computing systems (1995), ACM Press, pp. 312–313. 8. KORTUEM , G., BAUER , M., H EIBER , T., AND S EGALL , Z. Netman: The design of a collaborative Sharing Multimedia and Context Information Between Mobile Terminals Jani Mäntyjärvi, Heikki Keränen, and Tapani Rantakokko VTT Electronics, Technical Research Centre of Finland, P.O. Box 1100, FIN-90571 Oulu, Finland {Heikki.Keranen, Tapani.Rantakokko, Jani.Mantyjarvi}@vtt.fi ABSTRACT Mobile terminal users have needs for sharing experiences and common interests in a context sensitive manner. However, due to the current division of creation, delivery and access functionality of multimedia to applications, much user effort is needed to communicate efficiently. In this paper an approach for a user interface for mobile terminals to share multimedia and context information is presented and discussed. A map-based interface and domain object model -based user interface technique is utilized. INTRODUCTION Sharing of experiences using mobile technology is becoming more common since current mobile terminals enable capturing and delivery of multimedia content. However, due to physical limitations of mobile terminals to present and process multimedia they require particular user interface (UI) solutions. Current user interfaces do not provide means to share multimedia content effectively in real time since creating, delivering, and managing multimedia documents needs considerable effort. Context awareness of mobile terminals enables novel dimensions for mobile communication. Mobile terminals can share and present contexts by showing contexts of their members as symbols in a phonebook [12]. The sharing of context information enables the extension of the basic applications of mobile terminals with context features, for example context based call operation [13] and messaging [7]. Sharing of context information creates potential for more efficient multimedia distribution, augmentation, and content management. In this paper we present and discuss an approach for a user interface that supports the presentation and sharing of multimedia and context information together on a context aware map. Furthermore, we discuss technologies for enabling user interface solution. A UI solution for an online community is presented in more detail in [15]. TECHNIQUES FOR CREATING MULTIMEDIA Crossing application boundaries AND SHARING Applications are an artificial concept of computer science and for users there are often artificial boundaries between applications. In our case distinct applications exist for map- based positioning, taking photos or video shots, playing the media, sharing the media files created and showing the current context of each user. A great amount of user effort is required in order for crossing those boundaries as discussed in [10]. To deliver information to a community about what is happening and at which location a user needs to copy location information from the positioning application and a media file from the camera application to a message to be sent in an instant messaging application. On the receiving side user effort is needed to figure out the user's position relative to the position of his friend, because the position of his friend is in the received message in the messaging application and his own position is in the positioning application. Further effort is needed to figure out what his friend is doing right now by looking at the context sensitive phonebook if the sender didn't bother to write it directly in the message. As a solution to the problems caused by applications Raskin stated that there should not be any separate applications, but objects and operations that can manipulate those objects [10]. One of the intriguing technologies towards this direction is the Naked Objects framework [8] that maintains one to one correspondence of the domain or business object model and the UI by enabling the generation of the UI automatically from the domain object model. Our object model for enabling multimedia communication in an online community consists of people in the community, the multimedia files they create and share, and a shared map acting as a container object for people and multimedia files. Objects in the UI should look and function in a similar way regardless of the context where they are used and the size rendered [10]. By using icons it is possible to present objects in a very small space [3], which is important to fit more objects to a map being displayed on a small screen. By using the same icon in larger representations of the object the user easily associates the object with the one presented by the small icon. Maps Geographical maps have unique advantages being direct representations of the real world already familiar to users and exploiting human spatial memory. Positioning applications showing your place and route to a destination have been popular especially in car and boat navigation. The impression of connectivity to the real world can be enhanced by using positioning techniques to provide a real time up to date "you are here" position symbol to the map and using an electronic compass to keep the map parallel to the real world despite of device orientation. The idea of capturing position and context during multimedia creation and using that information for laying multimedia objects to geographical maps has been used successfully for multimedia retrieval [1]. It is easy to find images and video clips about certain situation or place from a map. The distinction between real world and augmented reality solutions is that maps help users to see farther than physically possible and get an overview of the environment faster than physically possible. This feature has been utilised in many navigational purposes to find a route from place to place. Geographical maps are also useful for presenting and finding electronic services having an unambiguous geographical location [9] instead of some kind of context aware menu which is constantly changing when you are walking in the city. Maps have the ability to visualise very heterogeneous objects being either physical like people or immaterial like video clips. The only requirement is that objects must have location information. Putting heterogeneous objects from different sources to a geographical map can help the user to get a good overview of how things are related to each other, which may help in decision making. We utilise this feature in our user interface. In our case there are terminals, which share their context information and multimedia objects they have created on a map online. Minimal user effort is needed to communicate their position and the context and context of multimedia created. New media objects can be represented by a blinking icon and when the map becomes crowded the oldest media objects can be removed from the map in a similar way that instant messaging applications are removing the oldest messages. Bluetooth. Here we discuss the presentation of context information available in mobile terminals to support online communities. As discussed in the previous section the UI solution for online communities should present many types of information including various multimedia documents, context information and group interests and preferences in an online manner, and at the same time to keep the UI clear and easy to use. Context information represents the current state of the object or its environment and can be presented as pictures. The classification of UI pictures for small interfaces is provided in [4]. Their explanation based on [2] indicates that picture classes for small UIs are Iconic, Index, and Symbolic pictures. Most UI pictures are Index pictures as they are associated with a function. In the work of Schmidt et al. [12] availability and location information is presented as pictures in the phonebook. The availability is presented as Symbolic color codes similar to traffic lights while the location is presented as Index pictures of a house indicating ‘at home’, a factory indicating ‘at work’ and a car indicating ‘on the way’. In our UI context information describing a person's state is coded into the Iconic picture of that person as presented in Table 1. Animation can be used to reflect user activity like walking, running etc. People can express themselves by selecting the icon set representing them, which brings challenges and possibilities for graphic artists. Table 1. Context information with classes used in user interface. User activity Standing Walking Running Chatting Environment Silent Loud Dark Bright CONTEXT INFORMATION A mobile terminal may be aware of the context of its user [6,11]. Data provided by several onboard sources, e.g. various types of sensors, and remote sources, e.g. location services, can be processed to a context representation in which context abstractions describe concepts from the real world, for example loud, warm or at home. This facilitates the utilization of context information e.g. in various applications and in communicating context to other terminals [5,6]. Describing the context information using commonly agreed ontology is one way to achieve this. The sharing of context information between several terminals can be realized using the latest communication standard protocols, e.g. GPRS, 3rd generation networks and Cold Warm Hot Device Activity Call Browse Chat Idle Context information related to the environment and device is more challenging, because these are not first class objects having icons in UI. Therefore we present context information related to environment and device only on request as index and symbolic icons (Table 1) in the same way as done in [12]. PROTOTYPE We have created a context-aware map-based interface for accessing situated services with mobile terminals [9]. The current prototype, which is built on the Compaq iPAQ 3660 PDA, includes positioning via WLAN and context based control via an external sensor box [14]. XML-based maps are rotated with the aid of a compass sensor, and zooming and scrolling can be performed by a user's gestures derived from proximity and accelerometer sensors, respectively. An ontology for describing sensor-based context information is used in sharing context data [16]. We are exploring the Naked Objects framework [8], as a user interface solution for interacting with objects and extending the Naked Objects platform by implementing an object viewing mechanism (OVM) for PocketPC style devices, because the original framework contains an OVM only for desktop PC. Figure 2. A screenshot of online information sharing (context information-window) Context data is shown with clear visualisations. With the map-based interface, the author and current location are associated with the multimedia document (Fig.1b), and the document is added on the map (presented as an icon, Fig. 1a). Fig. 1a presents a map-based view seen by User 1. His position is shown in the middle of the screen, in the center of sight. Two other users are also in the visible area, and part of their route is illustrated with broken lines. (a) (b) The pull-down menu shows available communication operations. The context view (Fig. 1c) shows the detailed context. The context represented by several symbols provides a partial description of the situation, yet the interpretation and understanding of the overall situation is the user’s task. In Fig. 2 a screenshot of online information sharing is presented using a context information –window. Context data is shown with clear visualisations. (c) Figure 1. Screenshots of a UI. A user has created a video clip and placed it on a map. Arrows between screenshots describe navigation achieved by clicking at the starting point of the arrow. The representation and sharing of multimedia and context information with this UI solution does not require any effort for switching between various types of applications. To get rid of the concept of applications that are creating artificial boundaries for users, a lot of research and development is needed to make current computing systems support the division of software into objects and operations. SUMMARY AND DISCUSSION A UI solution for mobile terminals presenting and sharing multimedia with context information is introduced. An approach utilizes object oriented UI techniques and a shared geographical map to present multimedia objects and contexts of group members in the same view. The UI solution satisfied needs: • Sharing interesting findings from the environment by using multimedia and effortless communication of the current group situation. • Multimedia documents are presented on the map as icons to compress information representation and to provide easy access to the full content of objects. • Online sharing of context information (activity, device and environment) with simple but descriptive symbols. One concern with this approach is that the map becomes crowded due to active multimedia production and by bringing other objects and services to the map. This can be helped to some extend by map labeling algorithms, but some kind of map filtering methods are needed in the long run. Other issues requiring more concern (for technical implementation) include: • Deciding how the messages and context information are delivered and stored in the network, • Where the maps are loaded, • Who creates maps and in which format. Conference on, Multimedia and Expo, Vol.1, pp. 749752, 2002. 5. Mäntyjärvi, J. et al. "Collaborative Context Determination to Support Mobile Terminal Applications", IEEE Wireless Communications, Vol 9(5), New York, pp. 39-45, 2002. 6. Mäntyjärvi, J., Seppänen T., “Adapting Applications According to Fuzzy Context Information” Interacting with Computers”, Vol.15(3), Elsevier, Amsterdam, To Appear, 2003. 7. Nakanishi, Y., et at. Context-aware Messaging Service: A Dynamical Messaging Delivery using Location information and Schedule information. Journal of Personal Technologies, Vol.4, Springer Press, pp.221224, 2000. 8. Pawson, R., and Matthews R., "Naked objects: a technique for designing more expressive systems," ACM SIGPLAN Notices, Vol. 36(12), New York, USA, pp. 61-67, 2001. 9. Rantakokko, T., and Plomp, J., "An Adaptive MapBased Interface for Situated Services," Submitted to Smart Objects Conference, Grenoble, France, 2003. Moreover, aspects that need further investigation comprise of: 10. Raskin, J., The Humane Interface: new directions for designing interactive systems, ACM Press, 2000. • How the access of users to shared information can be limited. • How to handle terminals, which do not have a mapbased interface. 11. Schmidt, A., et al. Advanced Interaction in Context, LNCS n:o 1927, 2nd Intl. Symposium on Hand Held and Ubiquitous Computing, pp. 89-101, 1999. • How to provide support for representing more multiform context information In the future, we will continue the integration of the map interface to the Naked Objects platform. Moreover, user tests are required to obtain experiences in real usage situations and understanding of symbols used. REFERENCES 1. Hewagamage, K.P., Hirakawa, M., "Augmented Album: situation-dependent system for a personal digital video/image collection", IEEE Intl. Conference on Multimedia and Expo, Vol.1, pp. 232-236, 2000. 2. Hietala, V., Kuvien todellisuus, Gummerus, Helsinki, 1993, (In finnish). 3. Horton, W., "Designing Icons and Visual Symbols", Proc. of the CHI '96, ACM Press, New York, USA, pp. 371-372, 1996. 4. Makarainen, M., Isomursu, P., Exploiting multimedia components in small user interfaces, IEEE Intl. 12. Schmidt, A., et al. H., Context-Phonebook - Extending Mobile Phone Applications with Context, 3rd Intl. Workshop on HCI with Mobile Devices, Lille, France, 2001. 13. Schmidt, A., et al. Context-Aware Telephony over WAP, Personal Technologies, Vol. 4(4), pp. 225-229, 2000. 14. Tuulari, E., and Ylisaukko-oja, A. "SoapBox: A latform for Ubiquitous Computing Research and Applications," Intl. Conference on Pervasive Computing, Zürich, Switzerland, pp. 125-138, 2002. 15. Keränen, H., Rantakokko T., Mäntyjärvi, J, “Presenting and sharing multimedia within online communities using context aware mobile terminals. In IEEE International Conference on Multimedia and Expo, Vol.2. pp.641-644 2003. 16. Korpipaa, P.; Mantyjarvi, J.; Kela, J.; Keranen, H.; Malm, E.J., Managing context information in mobile devices IEEE Pervasive Computing, Vol.2(3) pp.42-51 2003. Using an Extended Episodic Memory Within a Mobile Companion Alexander Kröner, Stephan Baldes, Anthony Jameson, and Mathias Bauer DFKI, German Research Center for Artificial Intelligence Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany <first name>.<last name>@dfki.de ABSTRACT We discuss and illustrate design principles that have emerged in our ongoing work on a context-aware, useradaptive mobile personal assistant in which an extended episodic memory—the personal journal—plays a central role. The prototype system S PECTER keeps track of its user’s actions and affective states, and it collaborates with the user to create a personal journal and to learn a persistent user model. These sources of information in turn allow S PECTER to help the user with the planning and execution of actions, in particular in instrumented environments. Three principles appear to offer useful guidance in the design of this and similar systems: 1. an emphasis on usercontrolled collaboration as opposed to autonomous system initiatives; 2. provision of diverse, multiple benefits to the user as a reward for the effort that the user must inevitably invest in collaboration with the system; and 3. support for diverse forms of collaboration that are well suited to different settings and situations. We illustrate the way in which these principles are guiding the design of S PECTER by discussing two aspects of the system that are currently being implemented and tested: (a) The provision of multiple, qualitatively different ways of interacting with the personal journal allows the user to contribute to its construction in various ways, depending on the user’s current situation— and also to derive multiple benefits from the stored information. (b) S PECTER’s collaborative methods for learning a user model give the user different ways in which to contribute essential knowledge to the learning process and to control the content of the learned model. INTRODUCTION There is growing agreement, reflected in the very existence of this workshop, that an extended episodic memory can constitute a valuable component of systems that serve as personal companions. But there remain numerous open questions about how such a memory can be acquired and exploited. How much work will the user have to do to ensure that the extended memory is sufficiently complete and accurate; what form should this work take; and how can the user be motivated to do it? How can the system analyze the contents of the episodic memory so as to learn useful regularities that can in turn be exploited for the assistance of the user? In this contribution, we discuss and illustrate three of the principles that we have found useful in our ongoing work on the relevant prototype system S PECTER. After sketching S PECTER’s functionality and comparing it with that of some representative related systems, we formulate and briefly justify three principles which appear to constitute a useful approach to addressing these requirements. Then, two aspects of S PECTER are discussed which illustrate how these principles can serve as a guide to the many interrelated design decisions that need to be made with systems that feature extended episodic memories. BRIEF OVERVIEW OF SPECTER Basic Functionality S PECTER is a mobile personal assistant that is being developed and tested for three mutually complementary scenarios, involving shopping, company visits, and interaction at trade fairs. S PECTER exhibits the following characteristic set of interrelated functions: It extends its user’s perception by acquiring information from objects in instrumented environments and by recording (to the extent that is feasible) information about the user’s actions and affective states. It builds up a personal journal that stores this information. It uses the personal journal as a basis for the learning of a user model, which represents more general assumptions about the user (e.g., the user’s preferred ways of performing particular tasks). S PECTER refers to the information in the personal journal and user model when helping the user (a) to create plans for future actions and (b) to adapt and execute these plans when the time comes. Relationships to Previous Work Provision of Multiple Functions The idea of building up a personal journal figured prominently in the early system F ORGET-M E -N OT ([4]), though at that time the technology for communicating with objects in instrumented environments and for sensing the user’s affective states was much less well developed than it is now. The much more recent project M Y L IFE B ITS ([3]) has similarly explored the possibility of maintaining an extensive record of a user’s experience, but here the emphasis is more on managing recordings of various sorts (e.g., videos) than on storing more abstract representations. The idea of having a personal assistant learn a persistent user model can be found to some extent in many systems, such as the early C ALENDAR A PPRENTICE ([6]); but these systems have not used multifaceted personal journals as a basis for the learning, and there has been little emphasis in involving the user in the learning process. The idea of providing proactive, context-dependent assistance is reflected in many context-aware systems, such as shopping assistants and tourist guides; but there is much less emphasis on basing such assistance on a rich user model or on active collaboration by the user. The idea of collaboration between system and user has been emphasized in several projects; in particular, the C OLLAGEN framework (see, e.g., [8], [9]) is being used explicitly within S PECTER. The necessary collaboration effort by users implied by the previous principle will not in general be invested by users unless they see the effort as (indirectly) leading to benefits that clearly justify the investment. Designers of a system that requires such collaboration should therefore try to ensure that the system provides multiple benefits as a reward for the user’s investment. Even if only one or two particular types of benefit constituted the original motivation for the system’s design, it may be possible and worthwhile to look for additional functions that take advantage of the same user input. DESIGN PRINCIPLES In this section, we present and briefly justify three design principles which have proven useful in the design of S PECTER and which should be applicable to some extent to related systems. User-System Collaboration as the Basic Interaction Model In the foreseeable future, no computing device will be able to perform the functions listed above without receiving a significant amount of help from the user at least some of the time. For example, a system cannot in general record all actions of the user that do not involve an electronic device; and the user’s affective reactions and evaluations are even more likely to be unrecognizable. Therefore, the user will have to provide some explicit input if a useful personal journal is to be built up. More generally, it is realistic to see each of the general functions served by the system as involving collaboration between system and user, although the exact division of labor can vary greatly from one case to the next. A different justification for an emphasis on collaboration is the assumption that, even in cases where help by the user is not required, users will often want to be involved in the the system’s processing to some extent, so as to be able to exert some control over it. Flexible Scheduling and Realization of Collaboration In addition to the user’s motivation, another obstacle to obtaining adequate collaboration from the user is created by situational restrictions. For example, when the user is performing attention-demanding activities and/or interacting with S PECTER via a limited-bandwidth device, she may be able to provide little or no input. A strategy for overcoming this problem is to look for ways of shifting the work required by the user to a setting where the user will have more attentional and computational resources available. For example, if the system can help the user to plan a shopping trip in advance, she may be able to use a high-bandwidth device and to supply information that she would not have time to supply while actually doing the shopping. Similarly, if S PECTER makes it worthwhile for the user to look back reflectively at the shopping trip after completing it, the user may be able to fill in some of the gaps in the record that the system has built up about the trip. INTERACTION WITH THE PERSONAL JOURNAL In accordance with the principle of multi-functionality, the data collection represented by the personal journal should be exploited in various ways. By providing methods for information retrieval, the journal may serve as extension of the user’s personal memory for individual events and objects. A quite different type of application may use these data to provide feedback on how the user is spending her time, suggesting how she could adjust her time allocation in order to achieve her goals more effectively. Yet another way of exploiting the personal journal, discussed in the final major section below, is for S PECTER to mine its data in order to learn regularities that can serve as a basis for assistive actions. The basis of these high-level interactions are so-called journal entries, which are created based upon signal input retrieved from an instrumented environment, or by means of abstraction. In the firmer case, fine-grained symbolic data are taken as input from sensors, and are directly stored in the journal. An exemplary setup of such an environment Figure 1: An environment for testing S PECTER with RFID input: an instrumented shelf with RFID-enriched products. On the right-hand side, a laptop which performs shop communication, and provides S PECTER with input. End-user display and interaction are performed via a PDA. is shown in Figure 1, where S PECTER has been connected with the RFID infrastructure created in the project REAL ([10]). The recorded signals serve as input for abstraction methods, which may range from syntactical, hard-coded translation to machine learning techniques. Journal entries are usually created automatically, which leads to several requirements to S PECTER’s user interface. Firstly, the kind of the recorded data and potentially the way how they have been retrieved should be transparent in order to strengthen the user’s trust in the system. Hence the user needs a facility for inspecting the journal. Furthermore, content incorporated through entries may contain errors: measurement errors and wrong abstractions may occur, and the user herself might change with time her opinion about the correctness of previously created entries. Accordingly S PECTER requires a user interface, which enables the modification of journal content. Finally, with respect to the goal of a flexible scheduling of collaboration these requirements are completed by the need for an interface, which is adaptable to varying application scenarios. Figure 2: The S PECTER browser, with a viewer for listing journal entries. In the upper area controls for navigation, in the lower area the viewer display. Interface Approach That need for flexibility is taken into account by a journal browser, which enables accessing the personal journal via so-called viewers. These realize varying data views of the information stored in the journal, a popular approach known from systems such as [3], [7], and [11]. An example of S PECTER’s journal browser and a viewer for displaying lists of journal entries is shown in Figure 2. In this framework, the browser as well as the viewers may be exchanged with respect to the given platform and interaction task. The browser is a central component that serves content requests from viewers, provides a repository of resources shared by several viewers. The latter ones include shared data such as display preferences, and shared user interface elements, such as access to common navigation facilities, and the viewer selection. That selection is in general performed automatically by the browser with respect to the display request, but may also be performed manually by the user if the viewer has registered itself within the browser’s user interface. Due to their varying functions, viewers may differ in their interaction not only with the user but also with the system itself. For instance, when displaying a list of journal entries, a viewer may be updated automatically when new entries (e.g., from sensors) arrive. That behavior might be confusing if the user is just entering data using a form-like viewer. Therefore the browser relies on a feature mechanism to configure itself with respect to the viewer’s preferences: a viewer’s configuration includes a list of feature triggers, which may be applied by S PECTER components such as the browser in order to adapt their behavior to the given viewer. Following our previous example, this way the form editor may indicate that display updates are not granted while the viewer is active. Navigation and Annotation Navigation in the personal journal relies in the first place on requests, which provide a similar functionality as hyperlinks. Instead of Web addresses, they make reference to particular S PECTER components, optionally further described using form parameters. Additionally, they may carry a complex value encoded in XML that is submitted to the requested component. The user may apply these requests to browse the journal similarly as the Web, and may organize frequently used requests in a list of bookmarks. An alternative way of navigating the journal is provided by the so-called reminder points. This specific kind of journal entry is created by the user during interaction with the environment with only one click (see the “!” button in the upper right corner of Figure 2). The rationale of these points is that the user might be too busy or distracted to provide detailed feedback. Nevertheless she might notice the need to adjust the system’s behavior, and this need can be expressed via a reminder point. Later on at a more appropriate time and location for introspection, she may inspect the recorded reminder points and perform in collaboration with S PECTER the required adjustments. Another way of dealing with journal entries is annotation. It provides a means of associating information with entries quite similar to the approach applied in [3]. In S PECTER, annotations serve in the first place as storage for information about how an entry performs with respect to selected aspects of the user model. Accordingly annotations include free text, references to other journal entries or Web pages, content categories, and ratings. Here content categories represent predefined content descriptions, provided by S PECTER for quick (and less precise) description of the kind of content. A rating expresses the performance of an entry with respect to a rating dimension selected from a predefined set (e.g., importance or evaluation). Annotations are further described by a fixed set of meta data. These capture information about the annotation such as a privacy level, and the source of the annotation. The latter one is of particular importance, since the user has to stay informed about who has created an annotation - she herself, or S PECTER. An editor for entry annotations is shown in Figure 3. The form-like viewer provides feedback about the annotations associated with an entry, and Figure 3: A form-like viewer that enables annotating a journal entry: a field for entering a free text comment, check boxes for content category selection, and select boxes for performing ratings. The selected values are marked with their sources (see “by”). enables editing them in part. COLLABORATIVE LEARNING OF THE USER MODEL As was mentioned above, one of the benefits offered by the personal journal is the ability of the system to learn a user model that can in turn serve as a knowledge source for intelligent assistance. Just as the acquisition of data for the personal journal is best viewed as a collaborative process, the same is true of the process of learning the user model. The system brings to this process (a) a large amount of data from the personal journal, (b) a repertoire of learning techniques, and (c) a large amount of computing capacity (and unlimited “patience”) that can be applied to the learning task. But the user’s input will in general also be necessary: Her common-sense knowledge and introspective abilities can help to filter out spurious additions to the user model that would simply reflect chance regularities in her behavior (cf. [1]). Moreover, when there are several possible learned models that are equally well supported by the data, the user may reasonably prefer the model that makes the most intuitive sense to her. In short, the basic conception of S PECTER gives rise to a novel interface design challenge: How can a user who has no technical knowledge of machine learning or user modeling be allowed to collaborate in the process of learning a user model from data? We will look at this problem in connection with one particular function of S PECTER’s user model: that of triggering the offering of services to the user. S PECTER offers several types of service to the user. Some of these make use of external resources (e.g., technical devices such as printers), while others make use of internal functions of the system (e.g., retrieval of facts from the personal journal). If S PECTER simply waited for the user to request each possible service explicitly, many opportunities would be lost, simply because the user is not in general aware of all currently available and relevant services. Therefore, S PECTER tries to learn about regularities in the user’s behavior that will allow it to offer services at appropriate times. For example, if the learned user model indicates that the user is likely to want to perform a particular action within the next few minutes, S PECTER may offer to activate a service that will facilitate that action. While there exist a number of approaches for collaborative learning that involve a human in the process of constructing a classification model (e.g., a decision tree), these approaches focus on supporting data analysts as opposed to essentially naive users (see, e.g., [2]). We are currently investigating the use of machine learning tools like TAR2 ([5]) that apply heuristics to produce imperfect, but easily understandable—and thus, modifiable—classification rules. Here we present an assistant component, the trigger editor, which gives the user intelligent suggestions for creating and modifying trigger rules for services. Example: The EC Card Purchase Service Our discussion will refer to the following example. Suppose that the user sometimes pays in stores with an EC card.1 At one occasion the cashier rejects her card telling her that her bank account provides insufficient funds to pay for her shopping. In order to prevent this embarrassing experience in the future, the user sets a reminder point, thus marking the current situation—and the resulting entry in the personal journal—to be dealt with later. The rationale of this is that the user decided to create an automated service that triggers a status check of her bank account—a basic functionality provided by S PECTER— whenever an EC card payment is likely to occur. In order to do so she will create an abstract model of this particular type of situations using S PECTER’s machine-learning capabilities. Whenever a new shopping situation occurs, S PECTER will use this model to classify the situation and trigger the bank account check in case this classification indicates a high probability of the user using her EC card. Identifying Training Examples In a first step, S PECTER’s machine-learning component needs a number of training examples—previous shopping episodes stored in the personal journal that can be used to distinguish EC payments from “non-EC payments”. The system displays the entry marked with the reminder point and asks the user to indicate what is special about it. The user indicates the use of the EC card (“MeansOfPayment = EC card” in the personal journal) whereupon S PECTER looks for previous entries of the same category (shopping) with identical and differing values for MeansOfPayment and classifies these examples according to this value (“positive” for EC card, “negative” for all other values). Learning a Decision Tree Once the training data have been identified, S PECTER applies a machine-learning algorithm to create an appropriate classifier. One useful learning technique in this context is decision tree learning, which yields a relatively comprehensible type of model (see Figure 4). Even though users would seldom be willing or able to define a reasonably accurate decision tree entirely by hand, critiquing a decision tree proposed by the system may be a reasonably easy— and perhaps even enlightening—activity, if the user interface is designed well. Figure 4 shows two decision trees that the user might deal with in connection with the EC Card Purchase service. Each node of a tree is labeled with an attribute, and each edge specifies a possible value (or range of values) for the attribute. Each leaf of the tree is labeled as positive or negative, indicating the decision that results if a path is traversed through the tree that leads to this leaf. In the case of service triggering, a positive result means that S PECTER should establish the goal of invoking the service. (Whether or not the service is actually invoked can depend on other 1 For non-European readers: An EC card is like a credit card except that the funds are transferred to the recipient directly from the purchaser’s bank account. (b) Tree generated after the attribute "Store" (a) Initially generated decision tree has been specified by the user to be irrelevant Figure 4: Two examples of decision trees that arise during the collaborative specification of a rule for triggering the service EC card purchase. factors, such as the existence of competing goals.) 2 When the system presents a learned tree such as the one in Figure 4 (a), the user can critique it in any of several ways, including: eliminating irrelevant attributes, selecting paths from the tree, and modifying split decisions. The question of what interface designs are best suited for this type of critiquing requires further exploration and user testing; the next subsection describes the critiquing interface currently being tested in S PECTER. Critiquing of Decision Trees Figure 5 shows the two main dialog boxes of the current decision tree editor, which is implemented as a viewer that runs within the browser. The interface allows the user to critique the current decision tree for a given type of decision step by step until she is satisfied with the result. The standard interface hides the potential complexity of a decision tree as depicted in Figure 4 by merely listing the set of attributes used by the machine-learning component (see Figure 5 (a)). Depending on regularities occurring in the training data, some of these attributes—although well-suited to discriminate positive and negative examples in the decision tree—might make little sense from the user’s perspective. For example, if the user happened to use her EC card only in the morning in all shopping episodes recorded by S PECTER, then the attribute TimeOfDay will almost inevitably be used in the decision tree. The user’s background knowledge, however, enables her to easily identify such “meaningless” aspects and remove this attribute altogether, thus preventing its use for classification purposes. 2 Attributes other than the ones shown in this example may also be relevant—e.g., the total cost of the other items that are still on the user’s shopping list and which therefore remain to be purchased on the same day. Another way of critiquing the decision tree is to replace an attribute by another one which is semantically related. To this end, whenever the user presses the “Add related attribute” button, S PECTER will identify concepts in the domain ontology that are in close proximity to the one represented by the attribute under consideration and generate appropriate attributes to be used in the decision tree. This way, the system’s capabilities to deal with regularities and statistical relationships among the training data is complemented by the user’s ability to deal with semantic interrelations. Advanced users have even more options to influence the machine-learning component of S PECTER. She can directly inspect the classification model (either depicted as a decision tree, a set of rules (see Figure 5 (b), or visualized in some other way yet to be investigated) and change the split criteria, i.e. the attribute values tested in a rule or tree (e.g. change Price from 117.50 to 100 as depicted in Figure 5). Doing so will of course affect the classification accuracy, i.e. the percentage of correct classifications of episodes from the personal journal. The user is informed about the current quality of the classification model and can bias the system to produce “false positives” rather than “false negatives” (which would mean that the bank account is checked even in some situations when the user will not use her EC card) or vice versa, depending on which error is more serious for the user. One of the many design issues that we are exploring in connection with this interface is the question of whether (a) to present to the user a graphical depiction of the decision tree (as shown in Figure 4), (b) to stick to dialog boxes such as those in Figure 5, or (c) to offer a selection of interfaces. The principle of flexible scheduling and realization of collaboration suggests providing several different (though fundamentally consistent) views of a decision (a) Simple critiquing options (b) Advanced critiquing options Figure 5: The basic (left) and advanced (right) dialog boxes in the current version of the S PECTER decision tree editor. tree that are appropriate for different usage situations (e.g., a quick check on a rather unimportant decision tree vs. indepth analysis and editing of a highly important one). The two different dialog boxes shown in Figure 5 represent a step in this direction. CONCLUSION AND FUTURE WORK In this contribution we described in part our ongoing work on S PECTER, a system that aims at assisting users in instrumented environments by means of an episodic memory. First results include a set of design principles, which have already proven their value as guides through a potentially immense design space. We believe they may be found helpful by designers of other systems featuring advanced personal memories. These days, a commercial product (Nokia LifeBlog3 ) was released that allows the user to create a simple version of what we call a personal journal. This provides a hint that this kind of functionality may sooner or later enter our daily lives. 3 www.Nokia.com/lifeblog We have illustrated how these design principles have been applied during the development of one of S PECTER’s most important components: the personal journal. In the sequel we concentrated on the interface approach consisting of a journal browser and varying viewers. Here the browser allows navigating the journal contents using a hyperlinklike approach, and the viewers provide varying data views. An application of this interface is the decision tree editor. By means of our prototype implementation, we illustrated how the end user may construct triggers for services provided by S PECTER. This construction is basically an iterative process, where S PECTER is creating candidate decision trees that might serve as triggers, and the user is critiquing these trees. This process is supported by the editor in various ways including attribute selection, biasing the learning component with the aim to minimize conseuquences resulting from classification errors, and manual modification of the generated models. Our next steps will include the extension of the view-based interface. For instance, we have to acquire information about which kinds of viewers are actually required by the end user, and we have to evaluate implemented viewers. Additionally, the machine-learning component will need an interface that makes its complicated inferences and the resulting models accessible to even naive users. ACKNOWLEDGMENTS This research was supported by the German Ministry of Education and Research (BMB+F) under grant 52440001-01 IW C03 (project S PECTER). Furthermore, we like to thank the REAL team for the valuable advice and support. 1. Gediminas Adomavicius and Alexander Tuzhilin. User profiling in personalization applications through rule discovery and validation. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 377–381, San Diego, CA, 1999. 2. M. Ankerst, C. Elsen, M. Ester, and H.-P. Kriegel. Visual classification: An interactive approach to decision tree construction. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 392–396, 1999. 3. J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong. MyLifeBits: Fulfilling the Memex vision. ACM Multimedia, pages 235–238, 2002. 4. Mik Lamming and Mike Flynn. “Forget-me-not”: Intimate computing in support of human memory. In Proceedings of FRIEND21, the 1994 International Symposium on Next Generation Human Interface, Meguro Gajoen, Japan, Meguro Gajoen, Japan, 1994. 5. T. Menzies, E. Chiang, M. Feather, Y. Hu, and J. D. Kiper. Condensing uncertainty via incremental treatment learning. Annals of Software Engineering, Special issue on Computational Intelligence, 2002. 6. Tom Mitchell, Rich Caruana, Dayne Freitag, John McDermott, and David Zabowski. Experience with a learning personal assistant. Communications of the ACM, 37(7):81–91, 1994. 7. D. Quan, D. Huynh, and D. R. Karger. Haystack: A platform for authoring end user semantic web applications. In Proceedings of the 2nd International Semantic Web Conference (ISWC2003), pages 738–753, Sanibel Island, Florida, USA, 2003. 8. Charles Rich and Candace L. Sidner. COLLAGEN: A collaboration manager for software interface agents. User Modeling and User-Adapted Interaction, 8:315– 350, 1998. 9. Charles Rich, Candace L. Sidner, and Neal Lesh. COLLAGEN: Applying collaborative discourse theory to human-computer interaction. AI Magazine, 22(4):15–25, 2001. 10. M. Schneider. Towards a transparent proactive user interface for a shopping assistant. In A. Butz, C. Kray, A. Krüger, and A. Schmidt, editors, Proceedings of the Workshop on Multi-User and Ubiquitous User Interfaces (MU3I 2004) SFB 378, Memo Nr. 83, 2004. 11. C. Shen, B. Moghaddam, N. Lesh, and P. Beardsley. Personal Digital Historian: User interface design. In Extended Abstracts of the 2001 Conference on Human Factors in Computing Systems, 2001. u›Photo: A Design and Implementation of a Snapshot Based Method for Capturing Contextual Information Takeshi Iwamoto Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa iwaiwa@ht.sfc.keio.ac.jp Shun Aoki Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa shunaoki@ht.sfc.keio.ac.jp Kazunori Takashio Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa kaz@mkg.sfc.keio.ac.jp ABSTRACT In this paper, we propose u-Photo, a method uses “action of taking photograph” as a metaphor for capturing contextual information. u-Photo is a digital photo image that can store not only visible pictures but invisible information that can be collected from embedded sensors and devices in ubiquitous environment. Using u-Photo, a user can intuitively capture and view contextual information. Moreover, we present several applications that become possible by u-Photo, remote control of devices, remote monitoring of environment, suspend/resume of user’s task. Keywords Pervasive Computing Architecture, Contextual Information, Sensors, Smart Appliances INTRODUCTION Today, ubiquitous computing is becoming a popular research field and various issues are addressed. In ubiquitous computing environment, many sensors and devices are spread around a user and can be embedded in environments such as a living room, office and so on. These invisible sensors and devices can obtain many information about users or environment and can provide it as contextual information to applications or middleware. In this paper, we propose a suitable method for capturing the contextual information , named Genta Suzuki Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa genta@ht.sfc.keio.ac.jp Naohiko Kohtake Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa nao@ht.sfc.keio.ac.jp Hideyuki Tokuda Graduate School of Media and Governance Keio University 5322 Endo Fujisawa Kanagawa hxt@ht.sfc.keio.ac.jp u-Photo. u-Photo is a digital photo image which contains contextual information of an environment. In other words, u-Photo can store invisible information obtained from embedded devices or sensors along with ordinary photo image. Furthermore, objects in the picture are used as keys for controlling and obtaining surrounding environment information of the object. When taking a u-Photo, the “viewing finder” and “releasing shutter” actions are identical to the ordinary digital camera. The action of “viewing finder” determines the target area for capturing contextual information, and “releasing shutter” determines the timing. Through these actions, namely “taking a photograph”, contextual information can be stored into a digital photo image as u-Photo. In order to provide intuitive method of viewing contextual information, we present “u-Photo Viewer”, which is a viewer application for u-Photo. u-Photo Viewer provides easy access to stored contextual information in u-Photo. Users are provided with GUI for viewing contextual information and controlling devices taken in u-Photo. u-Photo Viewer places the GUI for controlling devices over the objects in the picture. In this paper, we present the design and implementation of our system to realize u-Photo. The remainder of this paper is structured as follows; Section 2 presents the scenario of using u-Photo. Design issues of our research are described in section3, and the design and implementation are describe in section4 and 5. Related works are summarized in section 6. SCENARIO In order to clarify our research goal, now we present several scenarios using u-Photo. Scenario1: Controlling Remote Devices using u-Photo Bob takes pictures of his room, which are stored as u-Photo in his PDA. He goes out to work, forgetting to turn off the room light. After finishing work, he realizes he might have left the room light on. To check whether the light is on or not, he uses the u-Photo Viewer in his PDA and taps the “light icon” displayed on top of the light image on u-Photo (shown in Figure1). His u-Photo Viewer responds and shows that the room light’s status is on. He then taps the “OFF” button which is displayed in the u-Photo to turn off the room light. Figure 2: GUI and Environmental Information Figure 1: (a) shows “taking a u-Photo” (b) shows GUI for controling the light Scenario2: Capturing/Showing Environmental Information After turing off the light, Bob decides to go home. Wanting the room to be comfortable when he gets home, he views the environment information of his room, such as temperature and brightness (shown in Figure2). Clicking the icon of each appliance on the u-Photo Viewer displays the working conditions of each appliance. He controls the air conditioner as he did the room light to make the room temperature more comfortable before reaching home. Scenario3: Suspend/Resume User’s Task On another day, Bob is watching a video at home, but must go out for an appointment. He takes a screen shot of the TV with u-Photo before suspending the show. After his appointment, he goes to a cafe to relax. There, he looks up a “Public Display Service”. To use the service, he opens the u-Photo that was taken when watching his video at home. By Operating the GUI on the u-Photo Viewer, he can easily migrate the state of the show to the public display. As a result, he can watch the rest of the show using the public display (shown in Figure3). DESIGN ISSUES In this section, we consider design issues for u-Photo. First, we address the issue concerning the snapshot based method Figure 3: (a)“taking a u-Photo” to save task state; (b) resume the task using public TV adopted by u-Photo. We will present the reason we adopted this method for building our system, for several methods exist in previous researches for dealing with contextual information. Second, we address issue on the domain of information that should be treated as contextual information. Snapshot Based Method In the previous scenario, using the snapshot based method enabled users to obtain several benefits such as referring contextual information of their room, controlling devices remotely and resuming suspended tasks. When designing a system dealing with contextual information, how a system decides the target area and the timing to capture contextual information tends to be the most important issue. Next we discuss these two issues, target and timing. In general, there are two approaches for target determination. In one approach, the system automatically decides the target without the user intervening. In another approach, the user decides on the target area. Evidently, the first approach of obtaining contextual information automatically is easier for the user, since no interaction or operation is necessary between the user and the system. However, a system taking this approach needs to decide on the appropriate range of capturing contextual information considering user’s requirements. This is difficult, for user’s requirements are prone to change depending on their situation. The second approach allows the user to specify the target area. Therefore, forcing an undesired target area on the user is avoided. However, this approach involves complicated operations in cases where the system can not provide an appropriate method for specifying a target area. To solve this problem, we use “snapshot” as a metaphor for “capturing contextual information”; a user can specify a target area through an intuitive operation similar to taking an ordinary photograph. Although this method is more complicated compared to systems that adopt an automatic capturing mechanism, our system has the advantage of providing a method of capturing contextual information intuitively. To be more specific, by taking a picture using a digital camera, a user can take a “u-Photo” which contains various contextual information about the area within the range of the finder. Ordinarily when taking a photograph, user places focus on objects that are in sight. Similarly, when taking a u-Photo, a user captures an area of contextual information based on objects, namely devices or sensors. We call the objects that are landmarks in the user’s sight as “key object”. In the u-Photo system, stored contextual information are indexed by key objects, so users may view contextual information nearby and control devices that are designated as key objects. photo image. As described in the scenario, users can control devices remotely by touching the icon corresponding to the device, and can recognize context such as temperature or brightness by reading information indicated on the u-Photo. Contextual Information in u-Photo Next, we discuss the issue of the kind of information u-Photo should treat as contextual information. Previous researches have taken up contextual information with several difference meanings. Therefore, we will first discuss our definition in this research. Roughly speaking, the definition of contextual information can be divided into two extremes which are highly integrated and fundamental. Fundamental information can be directly obtained from devices or sensors. This information is usually represented as numerical value or raw data, depending on the individual sensors or devices. Highly integrated contextual information is produced as a result of aggregation and interpretation of fundamental information, such as “the user is working” or “the user is talking with another person” and so on. In our research, we focus on the method of capturing contextual information from the real world, so our system deals with individual sensors and devices that provide only simple information on the environment. Dealing with highly integrated contextual information in u-Photo is future work. We assume that fundamental information can be classified into following three groups: To summarize the discussion of target are and timing of capturing, our approach which uses the action of taking a photo as a metaphor would be a reasonable method for dealing with contextual information. Device Information: Device information is information on the kinds of devices available in the environment taken in u-Photo. Device information is needed to indicate and display available devices on a u-Photo. By clicking the icon displayed in the u-Photo, the user can control appropriate devices as presented in the scenario. To create this icon, u-Photo system needs to recognize the available devices in the environment when the u-Photo is created. Sensor Information: Information from sensors embedded in a room or device is sensor information. Sensor information is obtained from sensors or devices directly, thus data format is dependent on the source. For u-Photo to be available in various environments, abstracted interface for handling data and representation depended interface need to be provided. Task Information: Task information is information on tasks that are performed by users when takeing a u-Photo. This information contains execution status of devices within a u-Photo. Using the stored information, a user can resume the task in other environment. Turning now to the method of viewing contextual information, in our approach, the information appears on top of the In the next section, we describe a design of mechanisms for managing these information in detail. The second issue, the timing the system captures contextual information needs to be discussed. We found three approaches to solve this problem. In the first approach, the system continuously captures contextual information or a user, and searches for an appropriate context later on when the user wants to refer to a particular information. This approach causes several problems such as scalability, for much disk space is necessary, and difficulty in searching appropriate information from the massively stored data. The second approach is to make the system decide on the timing. This approach causes problems similar to when the system automatically decides on the target area. The third approach which is the one we choose, is make the user himself decide on the timing. The action of taking the photo is interpreted as the timing decision by the system, so users may intuitively decide the timing of capture. as device, PCs and so on. In the scenario, the room light and the air conditioner are correspond to key objects. DESIGN System Overview In previous section, we discussed about design issues that should be addressed to develop the practical system for uPhoto. Components and the relation between them are shown in Figure 4. Most components belong to the u-Photo Creator that is the software for creation of u-Photo. Each component gathers proper information for creating u-Photo. When a user takes a u-Photo, u-Photo system should obtain an object location in the focus area, because contextual information must be mapped on an appropriate location on the image. Therefore, it is necessary to provide a mechanism for recognition of key objects in the u-Photo Creator by image processing. Device and Task Information Management We classify devices treated by u-Photo into two types. One can deal with media related data such as televisions, speakers and so on. Another is a simple controllable device such as a light, an air conditioner and so on. Figure 4: Design of u-Photo Creator We adopt Wapplet framework[5] to develop the first type of device, namely devices that deal with media data. In Wapplet framework, devices are abstracted as “Service Provider” that is a middleware running on individual devices. Service Provider has several types of interfaces for media types that devices can handle. For example, Television can handle two types of media type, video and sound output. Thus, a Service Provider of Television has two interfaces for each media type. We define four media types as following; video, audio, text and image. Interfaces for each media types to control devices are designed. If two devices provide same interface, users can use them alternatively even if actual devices are different. When user moves to other environment after taking u-Photo, usually it is not certainty that same devices are available. In Wapplet framework, not only interface but format of stored information are unified for every media types in order for suspended tasks migrate between devices using stored information. Description Format of Contextual Information u-Photo Creator need to store gathered information from each component into an image file. Since the description of information written in a u-Photo should be readable and wellformatted, we choose XML for description of various information obtained from each component. Sample XML description is show in Figure 5. <?xml version="1.0" encoding="shift_jis" ?> <u_photo xsize="640" ysize="480"> <timestamp>Tue Jan 13 03:48:05 JST 2004</timestamp> <location>Keio University SFC SSLab</location> <devices> <device id="1" name="PSPrinter"> <coordinate><x>51</x><y>161</y></coordinate> <wapplet name="PSPrinter"> <media_type>text</media_type><status>100000</status><time>0</time> <service_provider>PrinterProvider</service_provider> <ip>dhcp120.ht.sfc.keio.ac.jp</ip> </wapplet> <sensors></sensors> </device> <device id="2" name="CDPlayer"> <coordinate><x>349</x><y>343</y></coordinate><wapplet name="CDPlayer"> <media_type>audio</media_type><status>100000</status><time>0</time> <service_provider>AudioProvider</service_provider> <ip>dhcp120.ht.sfc.keio.ac.jp</ip></wapplet> <sensors></sensors> </device> Another type of devices are those that provide only simple interface for controlling themselves, such as the air conditioner shown in the scenario. Devices belonging to this type can be controlled by command based interface such as “on”, “off”, and so on. In u-Photo Creator, these devices provide description of GUI written in XML to control themselves. Each description correspond to one command to drive the action of the device. Sample description of the button that provide “light on button” is shown in figure 6 <device id="3" name="ColorProinter"><coordinate><x>542</x><y>308</y> </coordinate><wapplet name="ColorProinter"> <media_type>text</media_type> <status>100000</status><time>0</time> <service_provider>PrinterProvider</service_provider> <ip>dhcp120.ht.sfc.keio.ac.jp</ip></wapplet> <sensors></sensors> </device> </devices><sensors></sensors> </u_photo> Figure 5: Sample XML Format Object Recognizer As previously described, users may decide the target scope by viewing the finder of a camera. It is possible that several key objects are contained in a u-Photo. A key object is a landmark for deciding an area that a user wants to capture contextual information as u-Photo. An object that can meet requirements of the key object is visible and intelligent such <button name="ON"> <ip>131.113.209.87</ip><port>34567</port> <command>LIGHT_ON</command> </button> Figure 6: Sample Descrption of Button Both types of devices also should reply to requests from uPhoto Creator for obtaining service state or command list. Sensor Information Management When a user takes a u-Photo, an area is decided by the action of “view a finder” and “release a shutter”. The environmental information which are stored in the u-Photo must cover the area where the user desire to capture. Therefore, u-Photo needs a sensor system which can represent various areas as higher abstraction. For example, in the scenario, the sensor system needs to represent areas as “around the object” or “in the scope of finder”. By these abstract representation of areas, u-Photo can provide intuitive environmental information to the user. screen as a finder and also can release a shutter on the PDA. In result of this action of taking photo, u-Photo Creator creates a u-Photo and sends it to u-Photo Viewer running on the same PDA. MARS is a sensor system which can meet requests with specified sensing area from application, and provides sensor data acquired from sensors in the specified sensing area. MARS supports following representations of the area. “In the scope of the finder” “Around an object” This enables applications to acquire sensor data of the target area without considering independent sensors. In MARS, every sensors have its own meta-information. Sensors notify MARS of their own meta-information. On the basis of the meta-information, MARS determines if the sensor data is associated with application’s specified area. MARS defines the meta-information as listed in Table 1. meta-information The area of sensor existence The object of attaching the sensor Location information of the sensor The types of the sensor The format of the sensor data examples roomname char,bed,user (x,y,z) temperature,humidity 8byte,0.1/sec Figure 7: System Architecture u-Photo Creator To recognize a key object, LED Tag, which is shown in Figure 8 are attached to corresponding objects (shown in Figure8). LED Tag Recognizer capture images from USB Camera and process them to detect the location of objects on the image. Each LED Tag appearing different colors, that represent the ID of the object. LED Tag Recognizer looks up the directory service to obtain information about the object to be used by Device/Task Information Manager. Using the information which is result of lookup, u-Photo Creator checks the device status and obtains command list to contorl them . Table 1: meta-information of sensors MARS has a database which manages various meta-information of sensors. When an application requires its own area, MARS searches the information which is associated with the application’s request in the database. If some meta-information fit the request, MARS provides sensor data to the application. For example, when there are a room and a display in u-Photo, u-Photo Creator specifies its own area as “in the roomA” and “around the displayB”. MARS searches meta-information which have “roomA” or “displayB”. If MARS discovers them, the sensor which have the meta-information provides its data through MARS. IMPLEMENTATION Currently, our implementation is organized as shown in Figure 7. We assume that a PDA can be used as a digital camera, however, image proccessing on the PDA is difficult due to limitations of computation power. Therefore, the component for capturing images and detecting objects are separated away from PDA in our prototype implementation. However, users can determine the scope of u-Photo by using the PDA Figure 8: LEDTag on the light device All part of implementation of u-Photo Creator is written in We use EXIF[2] to embed all of the information from each component into JPEG image. Java. For image procesing,we used JMF2.1.1e (Java Media Framework). ing status of tasks and using photos as user interfaces of a target object in the photograph. MARS CONCLUSION Implementation of MARS was done on linux 2.4.18-0vl3, using J2SDK1.4.1 and PostgreSQL7.3.4. In this implementation, mica2[4] developed by UC Berkeley was used for the sensor, and temperature and brightness were acquired. When mica2 starts, it notifies its meta-information to MARS, and MARS registers the meta-information into a database. MARS is able to offer the sensor data according to the demand from u-Photo. u-Photo Viewer u-Photo Viewer was implemented on a PDA, Zaurus SL-860, and was written using Java. Current u-Photo Viewer provide following fuctionalities presented in the scenario; controlling devices, viewing environmental information and resuming the task stored in the u-Photo. RELATED WORK There has been similar researches that capture contextual information. NaviCam[6] displays situation sensitive information by superimposing messages on its video see-through displays using PDAs, head mounted displays, CCD cameras and color-code IDs. InfoScope[3] is also an information augmentation system using camera and PDA’s display without attaching any tags on objects. When the user points their PDA to buildings or places, the system displays the name of the place or stores in the building on PDA. DigiScope[1] annotate image using visual see-through tablet. In this system, the user can interact with embedded information related to a target object by pointing to the object. Although these researches are similar to u-Photo in terms of annotating image, they focuse on real-time use in which the users can interact a target object currenly in front of them. We concentrated on recording contextual information and reusing them in different environment. Truong et al.[7] have developed applications in which tasks are recorded as streams of information that flow through time. Classroom 2000, one of their applications, captures a fixed view of the classroom, the lecture and other web accessible media the lecture may want to present. In this approach, what to record or when to record streams depends on each applications. In addition, since tasks they target on are never executed again, every state of the task need to be recorded as streams. On the other hand, tasks we target on are reproducible, since we only note the status of tasks which is captured when the user release shutter to digital photos. Focusing on recording contextual information to digital photos, several products have already been provided. Statuses of cameras (e.g. focal length, zoom, and flash) are provided by digital cameras and information of global positioning system (GPS) are provided by cellular phones. However, present products and format of photos don’t provide methods for not- In this paper, we presented u-Photo, which provides an intuitive method for capturing contextual information. With this snapshot based method, users can easily determine a target and timing of capturing contextual information as desired. By using u-Photo, users can view contextual information, easily control device and suspend/resume their task. In our current implementation, we used mica2 as the sensor, Zaurus as the PDA for camera, and several devices for controlling. We achieved several applications presented in the scenario using this implementation. REFERENCES 1. Alois Ferscha and Markus Keller. Digiscope: An invisible worlds window. In Adjunct Proceedings of The Fifth International Conference on Ubiquitous Conputing, pages 261–262. acm, 2003. 2. Exchangeable Image File Format. http://www.exif.org. 3. Ismail Haritaoglu. Infoscope: Link from real world to digital information space. In Proceedings of the 3rd international conference on Ubiquitous Computing, pages 247–255. Springer-Verlag, 2001. 4. Jason Hill and David Culler. A wireless embedded sensor architecture for system-level optimization. Technical report, U.C. Berkeley, 2001. 5. Takeshi Iwamoto, Nobuhiko Nishio, and Hideyuki Tokuda. Wapplet: A media access framework for wearable applications. In Proceedings of International Conference on Information Networking, volume II, pages 5D4.1–5D4.11, 2002. 6. Jun Rekimoto and Katashi Nagao. The world through the computer: Computer augmented interaction with real world. In Proceedings of Symposium on User Interface Software and Technology, pages 29–36. acm, 1995. 7. Khai N. Truong, Gregory D. Abowd, and Jason A. Brotherton. Who, what, when, where, how: Design issues of capture & access applications. In Proceedings of the 3rd international conference on Ubiquitous Computing, pages 209–224. Springer-Verlag, 2001. The Re: living Map - an effective experience with GPS tracking and photographs Yoshimasa Niwa*, Takafumi Iwai*, Yuichiro Haraguchi**, Masa Inakage* Keio University *Faculty of environmental information **Graduate school of media and governance {niw, takafumi, hrgci, inakage}@imgl.sfc.keio.ac.jp ABSTRACT This paper proposes an application, The Re: Living Map, which provides an effective city experience using a mobile phone, GPS tracking and photographs, and describes a new method for constructing the system, named “gpsfred”. Keywords GPS, Tracking, Mobile Phone, Information Design INTRODUCTION The proliferation of network connected, GPS-enabled mobile phones have allowed people to utilize positional information via GPS. While accessing information with mobile phones is becoming routine, the current use of GPS information so far has been limited to navigation. However, as it is possible to obtain GPS information from anywhere while connected to network, we can expect to see various other practical uses to emerge in the future. The mobile phones’ functions as a digital camera are evolving rapidly as well. There are also some applications that enable communications through use of GPS information [3, 6]. One such example, The Living Map (Figure 1), is a previous research we have done. It is an online community tool that enables the user to exchange information about the city using a mobile phone. On this research, we used a city map as the interface for exchanging city information and creating network communities based on people’s interests. Although these applications utilize GPS and/or photographs, no previous application has put GPS tracking at its basis to our knowledge. We propose three phases that allows us to effectively experience the city with GPS tracking and photographs: Packaging, Reliving, and Sharing. Standing on these backgrounds, we propose an application that allows users to effectively experience (relive) the city from a new perspective using photographs and the GPS tracking system named gpsfred. The project is based on our previous research, The Living Map [3]. BACKGROUND By linking photographs and transforming them in a similar fashion as Hayahito Tanaka's Photo Walker [4], we can create a virtual three-dimensional space that users are able to vicariously experience. Noriyuki Ueda's GIS with cellular phone+WebGIS [8] also creates a virtual city by relating photographs and GPS position data. Figure 1: The Living Map EFFECTIVE EXPERIENCE WITH GPS TRACKING We propose three phases to realize effective city experiences. In the city, users can package their own experiences through taking photographs and using the GPS tracking systems embedded in the mobile phones (Packaging). After returning home, they use our proposed application, to relive their experiences (Reliving), and finally, to share their city experiences with others and experience those of others online (Sharing). Screen Screen 1. Packaging In the Packaging phase, users are able to take photographs any time they like. These photographs, while sequentially discrete, reflect the users’ display of strong interest (Figure 2). In contrast, information obtained through GPS tracking is sequential because it is always enabled in this phase, and it provides a common attribute for every user - position and time. For these reasons, the combination of GPS tracking and photographs allows users to package their experiences for sharing. Photograph Photograph Photograph Turn right Photograph Turn left Screen Screen Photograph Photograph Photograph Photograph REAL TIME AND REAL PLACE 1 2 3 4 5 6 7 Go straight 8 Go back Figure 3: Adding effects to photographs to enhance the reliving TAKE PHOTOGRAPHS PHOTO PHOTO PHOTO PHOTO PHOTO 1 3 6 7 8 GPS TRACKING 1 PACKAGING PHASE 3 Go straight EXPERIENCE PACKAGE 1 3 6 7 8 Go straight Go back 4 SHOW STRONG IMPRESSION Turn right 5 2 Figure 2: Experience package Turn right Turn left 6 2. Reliving In the Reliving phase, The Re: Living Map gives users a richer city experience than a collection of still photographs can give. This is accomplished by automatically adding effects to photographs and playing them back in intervals proportional to the actual time intervals in which the original pictures were taken. The effects are calculated from the GPS tracking data. (Figure 3 and Figure 4). This effect is created in accordance to the users' actions. For example, if the tracking data shows that the user turned right, we will see the next photograph push the previous photograph off the screen. Playing them back in intervals proportional to real time prompts the users' memories to fill in the intervals between them. Through these effects, users are able to effectively relive their activities at the city. Photograph 7 Figure 4: Automatic direction detection 3. Sharing In the Sharing phase, we propose a method to share experiences, named "Intersect". With Intersect, the intersections of GPS tracking data between users act as starting points for the sharing of experiences. If the owner of the intersecting experience had allowed others access to them, users are able to experience them themselves using the intersections as their entryways (Figure 5). The Intersect method provides users a new type of experience, attainable only through the sharing of individual experiences via digital means. 1 switch the GPS tracking on or off and take photographs anytime they wish. The photographs and GPS tracking data, which constitute the users' experiences, are stored in the users’ database via the internet. RE:LIVING MAP powered by gpsfred GPS TRACKING ON 2 GPS TRACKING OFF 1 VIEW MY EXPERIENCE 2 VIEW SHARED EXPERIENCE 3 PREFERENCE Intersection INTERSECT powered by gpsfred 1 2 Others' experience RE:LIVING MAP My experience Exit Select Back Select Off Exit Select Back Select On 3 3 Photograph 1 Time Figure 5: Shared experience using “Intersect” With The Re: living Map, we propose an online application implementing these three phases with GPS tracking and photographs to give users a (re)living of their city experiences. Figure 6: Application interface for mobile phone RE: LIVING MAP powered by gpsfred PHOTO VIEW PHOTO VIEW APPLICATION On this section, we propose an application, The Re: living Map, which provides effective city experiences on the aforementioned three phases. This application is implemented with gpsfred, an application framework we designed. We will first give an overview of The Re: Living Map, and then go into the details of each of the features. TRACKING VIEW INTERSECT VIEW PACKAGE VIEW Figure 7: Application interface for the PC Overview The Re: living Map consists of two interfaces, one for the mobile phone and one for the personal computer (Figure 6 and 7). The interface for the mobile phone provides users an interface mainly to package their own experiences. The interface for the personal computer provides an interface to relive and share their experiences. Package experience on the mobile phone The interface for the mobile phone has two functions; GPS tracking and taking photographs (Figure 6). Users may Relive experiences on the personal computer The interface for the personal computer has three views; The Photo View, the Intersect View and the Package View (Figure 7). In the bottom view, the Package View, users can select their own experiences that have been stored from the mobile phone. In the middle view, Intersect View, users are able to select a path from the experience selected in the bottom view. Finally, the top view, Photo View, shows photographs with effects in accordance with the users' activities that are automatically calculated from the selected path. Photo view GPS TRACKING FRAMEWORK DESIGN Although there are many applications which utilize GPS, no previous application to our knowledge has put GPS tracking at its basis. On this section, we describe a new application framework for GPS tracking. Figure 8 shows the Photo View. This view usually plays back photographs enhanced with effects in time intervals in proportion to real time. Experience Users are also able to experience the paths using a mouse. In this view, a mouse cursor tells users where they have been or went by changing shape. Users will come to see the cursor as representations of themselves, and the photo effects tell them the positional relationship of the photographs. gpsfred Combined, these effects prompt the users' memories to fill in the blanks between the photographs and effectively relive their experiences of the city. Network Layer Middleware Layer Hardware Layer Application Layer Effect: Turn right PHOTO VIEW A mouse cursor changes. Figure 8: Photo View Application Personal Computer Mobile Phone Internet Figure 10: Application framework for GPS tracking The framework has 4 layers; the network layer, the hardware layer, the middleware layer, and the application layer. These layers make the experience of the users more effective. The most important components are gpsfred on the middleware layer, and the application on the application layer (Figure 10). Features of gpsfred Intersect view and intersect method Users are able to select a tracking path on which to show in Photo View. Also, users are able to choose whether the selected path should accept Intersects from others users. Should the user allow it, the application sends the information about the selected path to the application server which is shared by all users. The server checks other paths and broadcasts the detected intersections, related path and photograph information as a shared experience. If the path has intersections and a user selects an intersection on the view, photographs from both experiences are presented in the Photo View (Figure 9). Select intersection to show shared experience. TRACKING VIEW I'm here. Figure 9: Intersect View gpsfred we have implemented a middleware which makes it easy to implement GPS tracking with mobile phones and applications using it. Developers who plan to use gpsfred to implement applications are able to extend any functionality of gpsfred to suit them to their needs through plug-ins. Figure 11 shows the 5 features of gpsfred. Support Plug-in - 5 Get a location 2 Plug-ins are supported in every level of gpsfred. Developers are able to extend any functionality of gpsfred. Some methods gpsfred provides by default are implemented in the form of plug-ins, in fact. 1 Repeat for tracking Take a photograph FUTURE WORK 5 3 Store to a database 4 Remake for using On this research, the primary issue for the future is the operation and evaluation of the application. Currently, several problems exist before large-scale testing can be attempted. One such problem is the phone bills generated from usage of the program. Solutions such as fixed-rate communication services are on the horizon, however. Extend with plug-ins Framework and application The current GPS tracking method gpsfred has a problem in where position detection takes at least approximately 15 seconds. We have recognized it as an implementation problem, and will modify it to avoid the problem. Figure 11: Features of gpsfred Generate tracking data - 1 REFERENCES The first step is repeatedly detecting the current position, generating the tracking data. Developers are able to change the repetition intervals by 1/100 second units. 1. Fujihata M. Ikedorareta Sokudo. ICC Gallery, 1994. Join photographs and tracking data - 2 A photograph taken while the GPS tracking is running is associated with the tracking data and stored. It adds chronological information to photographs, allowing developers to handle them in that order. Store into a database - 3 All data are stored into a database accessed via a network. For that reason, developers are able to flexibly handle both the tracking data and the photographs. Reuse tracking data and photographs - 4 gpsfred provides tracking photograph data in an XML format. Developers are then able to perform operations such as converting the axis of the tracking data, normalizing them, resizing the photographs to fit, and so on. 2. Fujihata M., and Kawashima T. Field-Work@Alsace. 2002. 3. Haraguchi Y., and Shinohara T., and Niwa Y., and Iguchi K., and Ishibashi S., and Inakage M. “The Living Map - A communication tool that connects real world and online community by using a map.” Journal of the Asian Design International Conference Vol.1, Oct., 2003, K-49. 4. Photo Walker. Available at http://www.photowalker. net/ 5. Sasaki M., Affordance – Atarashi Ninchi no Riron, Iwanami Shoten, 1994 6. Takahashi K., and Tsuji T., and Nakanishi Y., and Ohyama M., and Hakozaki K. "iCAMS : Mobile Communication Tool using Location Information and Schedule Information." School of Information Environment, Tokyo Denki University, 2003. 7. Ttsuchiya J., and Tsuji H. GPS Sokuryo no Kiso. Nihon Sokuryo Kyokai, 1999, pp.57-80. 8. Ueda N., and Nakanishi Y., and Manabe R., and Motoe M., and Matsukawa S. "GIS with cellular phone + WebGIS - Construction of WebGIS using the GPS camera cellular phone." The Institute of Electronics, Information and Communication Engineers, 2003. Relational Analysis among Experiences and Real World Objects in the Ubiquitous Memories Environment Tatsuyuki KAWAMURA, Takahiro UEOKA, Yasuyuki KONO, Masatsugu KIDODE Graduate School of Information Science, Nara Institute of Science and Technology (NAIST) Keihanna Science City, 630-0192 Nara, Japan {kawamura, taka-ue, kono, kidode}@is.naist.jp ABSTRACT This paper introduces a plan of an experiment to analyze relations among user experiences and real world objects on the Ubiquitous Memories system. By finding out the characteristics, we could develop functions for an automatic linking method among an experience and objects, and recommending experiences linked with several objects. In order to conduct this experiment, we attached 2,257 RFID tags to real world objects. We also categorized the objects into 21 purpose-based object types, and investigated shareability of experiences by a questionnaire in advance. Both works are important for us to focus analyzation paramters in the experiment. In this paper we represent basic results of the categorization of the attached objects and the questionnaire. Author Keywords Ubiquitous Memories, Real World Object, Augmented Memory, Wearable Computer. INTRODUCTION The research area of computational augmentation of human memory has been extensively studied in recent years. Rhodes, B. termed this augmentation of human memory “augmented memory’’ [1]. Especially, a “sharing of experiences” technology in everyday life attracts researchers who know that we would get richer knowledge if the rechnology were accomplished. The technology would give us a solution against a difficult matter we hardly experience and do not know how to overcome it. Rsearchers, however, have not known both what support techniques are there for collaboration among people and what support techniques should be implemented to realize the sharing of experiences in the real world. We believe that the problem is the current most important issue to accomplish the sharing of experiences in everyday life. We have studied the Ubiquitous Memories project since fall 1999. The overall aim of the project is to realize a digital nostalgia. The digital nostalgia would be created as an autobiographic history by linking a human experience with a real world object. The project first proposed its concept in 2001 [2], and implemented the prototype system that can operate in everyday life in 2002 [3]. We have conducted experiements to evaluate performance of the Ubiquitous Memories under the stand-alone user condition [4]. In 2004, we are conducting an experiment to analyze relations among user’s experiences and real world objects. This paper mainly introduces the plan of the experiment. We are planning to conduct a long-term experiment in the real world. The aim of this experiment is to investigate relations among experiences linked with objects and the objects. The experiment also would give us characteristics of the object-object relations. By finding out the characteristics, we could develop functions for automatic linking method among experiences and objects, and recommending experiences linked with several objects. For the experiment, we attach 2,257 RFID tags to real world objects. We then categorize the objects into 21 purposebased object types distinguished by functional attributes, and into 116 role-based sets contained the classes. The categorization of the objects will be used to clarify the identity of the objects by analyzing video data sets linked with objects in each category. We also investigate share-ability of experiences by a questionnaire including four questions. The questionnaire gives us a direction to find out what objects would be useful to analyze logs before we conduct the experiment. Furthermore, we do not know how to discover the mechanisms of the relations among people, objects, and experiences because of 2,257 objects, huge video data sets, and operations logs. UBIQUITOUS MEMORIES We have proposed a conceptual design for ideally and naturally bridging the space between augmented memory and human memory by regarding each real world object as the augmented memory archive. To seamlessly integrate between human experience and augmented memory, we consider that providing users with natural actions for storing/retrieving augmented memories is important. A “human hand” plays an important role for integrating the augmented memory into objects. Human body is used as media for both perceiving the current context (event) as a memory and propagating the memory to an object, i.e., the memory travels in all over his/her body like electricity and the memory runs out of one of his/her hands in our design. Terms of the latest version of conceptual actions [5] are defined as follows: Figure 1. The Ubiquitous Memories Equipment • Enclose action is shown by two steps of behavior. 1) A person implicitly/explicitly gathers current context through his/her own body. 2) He/She then arranges contexts as ubiquitous augmented memory with a real world object using a touching operation. Figure 2. The Location of the Experiment • Accumulate denotes a situation where augmented memories are enclosed in an object. The situation functionally means that the augmented memories are stored in computational storages somewhere on the Internet with links to the object. • Disclose action is a reproduction method where a person recalls the context enclosed in an object. The “Disclosure” has a similar meaning of replaying media data. Equipment Figure 1. depicts the equipment worn with the Ubiquitous Memories. The user wears a Head-mounted Display (HMD; SHIMADZU, DataGlass2) to view augmented memories (video data) and a wearable camera (KURODA OPTRONICS, CCN-2712YS) to capture video data of his/her viewpoint. The user also wears a Radio Frequency Identification (RFID; OMRON, Type-V720) tag reader/writer on his/her wrist. Additionally, the wearer uses a VAIO jog remote controller (SONY, PCGA-JRH1). In order to control the system, the wearer attaches RFID operation tags to the opposite side of wrist from the RFID tag reader/writer. The wearer carries a wearable computer on his/her hip. The RFID device can immediately read an RFID tag data when the device comes close to the tag. The entire system connects to the World Wide Web via a wireless LAN. System Operations The Ubiquitous Memories system has five operational modes: ENCLOSE, DISCLOSE, MOVE, COPY, and DELETE. There are two basic operation tags and three additional operation tags for changing the mode. The user can select one of the following types: Figure 3. The Seating Chart ENCLOSE: By touching the “Enclose” tag and an object sequentially, the wearer encloses augmented memory to an object. DISCLOSE: The user can disclose an augmented memory from a certain real world object. Using additional operation tags, the user can operate an augmented memory in the real world in the similar way as files in a PC by using the “DELETE,” “MOVE,” and “COPY” tags. A SUBSTANTIATION EXPERIMENT PROGRAM Purpose The aim of this experiment is to investigate relations among experiences and the objects linked with them. The experiment also could give us characteristics of the objectobject relations. By finding out the characteristics, we could develop functions for automatic linking method among an experience and objects, and recommending experiences linked with several objects. In order to achieve the aim, we are gathering the data of linking/rearranging/referring behaviors with the objects attached RFID tags. Subjects and Locations This experiment is conducting at the Nara Institute of Science and Technology (NAIST) in Nara, Japan among three graduate students of Information Science Department. They are in a laboratory. They belong to the same research group and are well known each other. Subject1 is the eldest student in the subjects. Subject3 is the youngest student in the subjects. All subjects are Ph.D course students. Figure 2 illustrates the environment of the experiment. The location is the building of the graduate school of information science at NAIST. The location is on the 7th floor of the ridge B. The location is composed of the room B708, B711 through B715. In the experiment, we labeled the area of B711 through B715 “room A” and the area of B708 “room B.” Additionally, the hallway, which is for the room A through the room B, is included in the experiment. Figure 3 shows the detailed location of the room A. The figure also describes desks the subjects usually use. Score (Avg) 1.0 -2.0 Rate (%) 48.74 20.07 Questionnaire survey for Share-ability of Experiences We investigated share-ability of experiences among users by a questionnaire. In order to accomplish the experiment, we should know how this subjects usually use real world objects and what they would employ the objects to link experiences with them in advance. The three test subjects answered the questionnaire. They had to answer the following questions as 5-level evaluation against 2,257 objects attached RFID tags. Q1: How is the object shared? (1:individual ~ 5:shared) Q2: How often do you use the object? (1:never ~ 5:often) 5.01 -4.0 4.25 -5.0 21.93 Table 2. Average Scores and Score Ratio in Q1 Score (Avg) 1.0 -2.0 -3.0 Rate (%) 23.84 50.02 17.28 -4.0 -5.0 7.00 1.86 Table 3. Average Scores and Score Ratio in Q2 Pattern Target Real World Objects This section represents basic information of 2,257 RFID tags attached to real world objects for the experiment. Subject1 has 538 belongings. Subject2 and Subject3 have 170 and 93 belongings respectively. Shared objects are 561, and others’ belongings 895 objects. A tag is attached to an object in general, although plural tags are attached to the object whose elements can be regarded as objects, e.g. each bay in a bookcase is attached an tag. Table 1. (see the last page of this paper) describes 116 classes of the objects, and 21 categories. Note that, “Class’’ is a set of objects distinguished by a functional attribute, and “Category’’ is a role-based set contained the classes. Additionally, the PC peripheral Type-A contains “keyboard,” “mouse” and “display,” and the PC peripheral Type-B means the other devices that are not daily used. The column of class in Table 1. show the number of the objects and “Mobility.” The mobility means how often an object is moved from a certain point to another point (1:never ~ 5: often). The categories also represent the ratio of object number and the average mobility. The subjects are allowed to additionally attach an RFID tag to an object when he finds the object he wants to link an experience with. -3.0 Low Number Ratio (%) 1,007 44.62 S1 529 23.44 S2 257 11.39 S3 103 4.56 S12 159 7.05 S13 11 0.49 S23 5 0.02 186 8.24 S123 Table 4. Number and Ratio of Group Paterns in Q2 Q3: How many experiences will you link with the object? (1:nothing ~ 5:a lot) Q4: How many experiences linked with an object will be shared? (1:nothing ~ 5:a lot) Table 2. illustrates the average scores and the ratio of the score in Q1. Here, “-2.0” means that 1 < Avg Score ≤ 2 . The rest of notations also show the same meaning. “-3.0” means that 2 < Avg Score ≤ 3 . “-4.0” means that 3 < Avg Score ≤ 4 . “-5.0” means that 4 < Avg Score ≤ 5 . The result shows that objects to which we attached RFID tags contain 68.81% individual objects (1.0 and -2.0) and 26.18% high-shared objects (-4.0 and –5.0). The result means that we could widely investigate relations among experiences and the objects in each score. Table 3 describes the average scores and the ratio of the score in Q2. The notation of the score in Table 3. is the same as one in Table 2. This result shows that nobody use 23.84% (1.0) of the objects and would also use them in the Score (Avg) 1.0 Rate (%) 13.56 -2.0 62.87 -3.0 21.18 -4.0 Pattern -5.0 2.39 0.00 Table 5. Average Scores and Score Ratio in Q3 Number Low Rate (%) Low 1,431 63.40 S12 586 25.96 S13 27 1.20 S23 0 0.00 213 9.44 S123 Pattern Number Ratio (%) 1,146 50.78 S1 335 14.84 S2 366 16.22 S3 108 4.79 S12 112 4.96 S13 7 0.03 S23 64 2.84 S123 119 5.27 Table 8. Number and Ratio of Group Patterns in Q4 Table 6. Number and Ratio of Group Pattern in Q3 Score (Avg) 1.0 -2.0 -3.0 -4.0 Rate (%) 0.18 29.11 45.90 22.29 -5.0 2.53 Table 7. Average Scores and Score Ratio in Q4 Figure 4. The Object Layout Plan in the Room B experiment. However, 8.86% of the objects (-4.0 and -5.0) would be often used. We should investigate the relations among experiences and this type of objects because the objects would be more useful than other objects for a user to rearrange his/her experience in his/her everyday life. On the other hand, 67.30% of the objects (-2.0 and –3.0) are also next important matters because less opportunities of use of them would make us easily forget experiences that are linked with the objects. same as one in Table 2. Unfortunately, there were not objects that would be linked with a lot of experiences by the subjects. In contrast, all subjects answered that they never use 13.56% of the objects. Table 4. shows that the number of objects, which would be used by subject(s), and the ratio of group patterns of subjects who checked the high score (over 3-point). Note that, “Low” shows that nobody checked high score. “S1,” “S2,” and “S3” means that only Subject1, Subject2, or Subject3 checked the high score. “S12,” “S13,” and “S23” show that two of subjects checked the high score. “S123” shows that all subjects checked the high score. Fortunately, all subjects would use 8.24% of the objects. Subject1 and Subject2 would also use experiences with 7.05% of the objects. Table 5. shows that the average scores and the ratio of the scores in Q3. The notation of the score in Table 5 is the Table 6. shows the number and ratio of group patterns of subjects who checked the high score. Note that, the labels of “Low” through “S123” show the same meaning of them in Table 4. Totally, the subjects would link experiences with 13.1% of the same object (S12, S13, S23, and S123) each other. Table 7. shows that the average scores and the ratio of the scores in Q4. The notation of the score in Table 8. is the same as one in Table 2. The subject would use 45.90% of the objects neutrally. Few objects would be employed individual (0.18%) or sharing experiences (2.53%) use. Table 8. shows that the number of objects, which would be used by subject(s), and the ratio of group patterns of subjects who checked the high score. Note that, the labels of “Low” through “S123” show the same meaning of them in Table 8. 63.4% of the objects would not employed to share experiences. The result represents that Subject1 and Figure 5. The Object Layout Plan in the Room A Subject1 Subject2 Subject3 Q1 vs. Q2 0.56 0.62 0.50 Q1 vs. Q3 0.33 0.56 0.54 Q1 vs. Q4 0.47 0.94 0.64 Q2 vs. Q3 0.80 0.82 0.95 Q2 vs. Q4 0.04 0.59 0.90 Q3 vs. Q4 -0.12 0.55 0.94 Table 9. Correlations among the Questions Subject2 would share experiences via 799 objects. Subject3 almost share experiences via the objects that are in the pattern S123. Figure 4. and Figure 5. are the location of objects that are voted the high score in both Q3 and Q4. Totally, 233 numbers of objects were selected. 109 numbers of objects are in S12, 5 numbers of objects are S13, and 119 numbers of objects are contained in S123. The objects would be linked with experiences, and the subjects will employ them for sharing experiences higher probability than other objects. We computed correlations among Q1, Q2, Q3 and Q4 (see Table 9.). The result of Q2 vs. Q3 means that all subjects would link experiences with the objects that are often used. Subject1 has little policy for sharing experiences when he would link experiences with the objects (Q2 vs. Q4 and Q3 vs. Q4). Subject2 considers that the shared object would be linked with share-able experiences (Q1 vs. Q4). Subject3 would link experiences with the objects that he often uses, and the experiences would be shared (Q2 vs. Q3, Q2 vs. Q4 and Q3 vs. Q4). The Plan of Relational Analysis Parameters In order to analyze relations among subjects, objects, and experiences in the experiment, we are planning to employ the following five parameters: 1) Operations • Enclose, Disclose, Move, Copy, Delete 2) Logs • Time, Referring user, Linking user, Refered object 3) Defined object categories • Category, Class, Mobility 4) Video contexts • Linked video, Video that was captured when a user referes a linked video 5) Questionnaire data • Four questions conducted in this paper Note that the video contexts will be divided into categories defined by the authors. Prepared Analyzing Topics We are mainly employing the following three analyzing topics: • What kinds of video data are linked with a certain object? (We expect that an identity of an object could be computed from the set of contexts of video data linked with the object.) • What kinds of purposes does a user have when the user chooses an object to link an experience with it? (The system could give the user heterogeneous services depending on the user’s purpose.) • What kinds of objects (or categories) are employed for sharing experiences? (This topic is approximately the same meaning of the second-topic.) We are, however, not sure how many kinds of relations would exist, and what kinds of relations would be reliable to make a user satisfied on the Ubiquitous Memories system. Therefore, we must investigate other relations in the experiment at the same time. Furthermore, we must clarify what kinds of parameters should be employed to find out valuable relations in the next stage of the experiment for supporting the user on the system. CONCLUSION We introduced a plan of an experiment to analyze relations among user experiences and real world objects on the Ubiquitous Memories system. We also investigated shareability of experiences by a questionnaire in advance. In order to accomplish the experiment, we should know how this subjects usually use real world objects and what they would employ the objects to link experiences with them. The results of questionnaire give us a direction to find out where we should analyze logs that will be recorded in the experiment. We are continuing the relation analysis from the questionnaire. The questionnaire could give us more detailed relations among subjects, objects, and experiences although this paper described basic results of the questionnaire in Table 2. through Table 9. For instance, we can analyze the relations on “class” level shown in Table 1. In addition to the above reason, we are sure that operation logs and video data will be huge in the experiment. A storage size of a server, which has logs and video data, will be over 5TB when all subjects link ten two-minutes video data with the objects everyday during a year. Total number of linked video data would be over 10,000. Discovering the mechanisms of the relations among people, objects, and experiences is difficult for us because of 2,257 objects, over 10,000 video data, and operation logs. Therefore we must analyze relations among subjects, objects, and experiences using the questionnaire in advance and parallel with the experiment. ACKNOWLEDGMENTS This research is supported by Core Research for Evaluational Science and Technology (CREST) Program “Advanced Media Technology for Everyday Living” of Japan Science and Technology Agency (JST). REFERENCES 1. Rhodes, B. The Wearable Remembrance Agent: a Sytem for Augmented Memory, Proc. 1st International Symposium on Wearable Computers (ISWC’97), 123128, 1997. 2. Fukuhara, T., Kawamura, T., Matsumoto, F., Takahashi, T., Terada, K., Matsuzuka, T. and Takeda, H. Ubiquitous Memories: Human Memory Support System Using Physical Objects, Proc.15th Annual Conference JSAI, 2001. (in Japanese) 3. Kawamura, T., Kono, Y. and Kidode, M. Wearable Interfaces for Video Diary: towards Memory Retrieval, Exchange, and Transportation. Proc. 6th IEEE International Symposium on Wearable Computers (ISWC2002), 31-38, 2002. 4. Kawamura, T., Fukuhara, T., Takeda, H., Kono, Y. and Kidode, M. Ubiquitous Memories: Wearable Interface for Computational Augmentation of Human Memory based on Real World Objects. Proc. 4th International Conference on Cognitive Science (ICCS2003), 273— 278, 2003. 5. Kono, Y., Kawamura, T., Ueoka, T., Murata, S. and Kidode, M. Real World Objects as Media for Augmenting Human Memory, Proc. Workship on MultiUser and Ubiquitous User Interfaces (MU3I 2004), 3742, 2004. A Framework for Personalizing Action History Viewer Masaki Ito Jin Nakazawa Hideyuki Tokuda niya@ht.sfc.keio.ac.jp jin@ht.sfc.keio.ac.jp hxt@ht.sfc.keio.ac.jp Graduate School of Media and Governance Keio University 5322, Endo, Fujisawa, Kanagawa, Japan ABSTRACT This paper presents a programmable analysis and visualization framework for action histories, called mPATH framework. In ubiquitous computing environment, it is possible to infer human activities through various sensors and accumulate them. Visualization of such human activities is one of the key issues in terms of memory and sharing our experiences, since it acts as a memory assist when we recall, talk about, and report what we did in the past. However, current approaches for analysis and visualization are designed for a specific use, and therefore can not be applied to diverse use. Our approach provides users with programmability by a visual language environment for analyzing and visualizing the action histories. The framework includes icons representing data sources of action histories, analysis filters, and viewers. By composing them, users can create their own action history viewers. We also demonstrated several applications on the framework. The applications show the flexibility of creating action history viewers on the mPATH framework. Keywords Action history, visualization, visual language INTRODUCTION In the ubiquitous computing environment where computers and sensors are embedded in our surroundings, it will be possible to recognize our action and record it. From the accumulated action history, we will be able to get highly abstracted context information of the human activity. These information are used to develop context-aware applications, and also used to provide us with useful information. Well presented action histories help our life such as retrieving memory and sharing our experiences. Several representation applications of the action history have also proposed[3][6][8][9][10][11]. These systems represent user’s activity to provide functionalities of navigation and indexed action histories. However these representations are de- signed for a specific use, and hence users can not customize them to acquire personalized view of their action histories. For example, though PEPYS[8] can organize human action histories based on the location- and time-axis, it can not handle additional information, such as images, related to an action history item. Activity Compass[9] provides a navigation functionality based on a location track analysis, which also lacks diversity of location history analysis. The action history viewers, therefore, should provide users with personalized views for enabling them to analyze, recall, talk about, and report their action histories. In this paper, we propose a new framework for creating the personalized action history viewer. The framework provides users with programmability by a visual language environment for analyzing and visualizing the action histories. The framework includes icons representing data sources of action histories, analysis filters, and viewers. By composing them, users can create their own action history viewers. This paper is organized as follows. In the next section, we introduce scenarios where various visualization technique for analyzing action history. The third section shows current techniques for analysis and visualization and clarify the requirements. In the 4th section, we introduce mPATH framework as a framework of personalizing action history viewer. Next four sections introduce the usage of the mPATH framework and show several applications. We evaluated the system in the following session and introduce related works. In the final session, we conclude this paper and suggest future work. SCENARIOS We introduce two scenarios where visualization method are shared and easily developed by users. These scenarios show usage of action history viewers with a personalizing function. The feature helps our communication and the deep understanding of past activity in the scenarios. Last week, Alice has traveled Kyoto, Japan. When she arrived at Kyoto station, she borrowed a PDA with GPS as a guide for sightseeing. While she was in Kyoto, the PDA assisted her in planning her travel and gave her a guidance at a tourist attraction. When she left Kyoto, she returned the PDA and received a small memory Navigation of a Travel Memory card in which her location track data, names of tourist attractions which she visited, and photos she took were recorded. Today she is talking about her travel with her boyfriend, Bob. She inserted the memory card to her PC and showed a map of Kyoto which was overlaid with lines of location track data. She started talking on her experience from the beginning of her travel, but immediately found the map were not designed to represent temporal aspects of her travel. She searched for visualization methods of travel experience and downloaded them. One method analyzed her travel log and calculated weights of each tourist attraction she visited by her walking speed and number of pictures. The method visualized the map of Kyoto with distortion which stands for the weights, she could intuitively know her travel. She thought shopping is also important to calculate the weights, she changed the parameters of analysis and generated a map which highly reflect her impression. The map realized smooth understanding of her travel for him. Now, Bob wants to go to Kyoto. He asked her to go with him, but she refused because she has just been there. He then decided to show the attraction of Kyoto she did not know. Development of Analysis and Visualization Method Since he is an amateur programmer, he decided to develop a visualization method in which attractive places where she did not visit were emphasized. He at first searches for a web guide of Kyoto in which tourist attractions are ranked and contains many pictures. Then he developed an algorism to find attractions she did not visit by comparing the web guide with her tour data. He designed to visualize a map of Kyoto with many photos of the attractions. He uploaded his algorism and asked her to download and apply it to her memory. She noticed unknown attractions in Kyoto and decided to visit again. VISUALIZATION OF ACTION HISTORY In this section, we define action history, and mention current techniques of visualization and analysis of action history. Then we clarify requirements of the visualization system. In this paper, an action history is an aggregated form of information which contains location, date and description of action about a certain person. Location track data obtained by GPS is one example of action history. Digital photo data is also an action history if it contains a time stamp and location information as its meta information. Text-based Visualization Text-based visualization is a simple technique to represent daily and special experiences. Without machinery, some people keep diaries not to forget daily events, and in some situations, to share a secret with a intimate friend by exchanging a diary book. Text-based style has no restriction of a format and contents, therefore we can easily represent various experiences in the style. However it is difficult to find certain information with a specific point of view from a diary. List is a structured format of text which shows specific aspect of experiences in order. A chronology is one example of list which shows a temporal aspect of history. This style helps intuitive understanding of a specific feature of action history such as time, event, and name of place. List PEPYS[8] represents user’s activity as a list in temporal order. We can know temporal context of each action. However, it is difficult to know spatial aspect of the action since rooms which he or she was in are represented only as names. A map, which is widely used to represent geographic information, can be also used to represent a spatial aspect of action history. Overlaying a readymade map with points and lines which suggest certain action history is a popular method for representing action history. We can understand an action history in a geographic context easily by using such a map. Map-based visualization is utilized in several researches[9][11]. However, a ready-made map contains only common objects like restaurants, gas stations and hotels, and not enough to show personal experiences. Map-based Visualization Especially for hikers and climbers who are logging their tracks by handy GPS devices, several applications in which mountains and valleys are shown as 3D graphics, and they are overlaid with lines of the tracks. KASHMIR 3D[10] is widely used in Japan, and Wissenbach Map3D[3] is also developed for the same purpose. These applications utilize digital elevation model of a certain area for creating terrain model. 3D Map Visualization Photo-based Visualization Photographs taken by a user represent his or her interest during his or her activity like travel. Simply placing many thumbnails of photos on a screen is widely utilized technique. However, spatial and temporal aspects of photos disappear in this visualization. STAMP[6] represents spacial relationship of each photo by linking the same object in two photos. We can brows photographs by following the links by a mouse. This method simultaneously visualize user’s points of view and spatial structure. Current Visualization Analysis of Action History There are several visualization methods for representing action history. We introduce some of them and argue their features. As most of the visualization technique represent only a few aspects of action history, analysis method of action history to extract certain aspects is also important in a visualiza- tion process. Currently, analysis and visualization are tightly combining. We argue them separately in this paper. Time and space are basic aspects of action history. We often utilize them to order action history and as clues of retrieving certain history. Most of the visualization technique are designed to represent both or either of them. For reducing the cost of developing a new analysis and visualization method, we divided the method into several components. We defined three types of component, data source, viewer and filter. Users can create visualization method by combining existing components. Component Based Architecture We defined a unified type of data in the system. The description of location, date and other information are unified. When we input action history or geographic information, they must be converted into single type of data. Standardization of Internal Data By analyzing action history, we can acquire highly abstracted information like a frequency of visiting a certain place, a daily pattern of a movement, and user’s interests and preferences. There are several researches to analyze location track data captured by GPS and extract such information [1][9][12]. Some of these researches utilize additional geographic information to detect an activity in a certain place. To utilize action history as an assistant of our memory and communication, we should understand various aspects of action history as the scenarios show. However, current visualization systems are designed for specific use of action history. To represent action history in various aspects, flexible analysis and visualization features are required. Requirements For flexible programming of visualization mentioned in the scenario, following features are required to the system. Flexibility of Data Input The system must treat various types of data available in the ubiquitous computing environment simultaneously. Action histories are characterized by their description of location, contents of what we did there, and the way of acquisition of the data. The type of data is used to exchange action history between components. Since data type is unified, we can easily design a component for several action history. The feature also realize flexible combination of components. To control combination of components, we provide a visual programming system of data flow style. Visual programming reduces difficulty of creation of original visualization method by combining components. Using Data Flow Style Visual Language Implementation We implemented a prototype of mPATH framework with Java. Our system is implemented as a GUI application and consists of 14,000 lines of Java language. Figure 1 shows a screen shot of the system. The system also need to treat geographic information for analysis of action history. A digital map and the Yellow Pages with address are examples of geographic information. For all users of the system, it must be easy to create their original visualization method. For skillful users, the system must provide a flexible programmability. Even for unskilled users, the system must provide possibility of changing a visualization method. Providing Programmability of Analysis and Visualization Sharing existing methods, which are programmed by third parties, also increases programmability of the system. Using existing method reduces cost of creating new analysis and visualization method. Figure 1: Screen shot of the system We developed a programmable analyzing and visualizing framework for action history named mPATH framework. In this section we introduce the features and implementation of the mPATH framework. In the current implementation, we mainly focus on realizing visual programming for data analysis and visualization. This version works as a platform for creating analysis method by data flow style programming. Exporting and importing of program function is still under construction. We are planning to use XML to exchange them. Approach Experiment The followings are the approaches taken by the mPATH framework to accomplish aforementioned flexibility. Since June 2003, Ito, the first author of this paper, has been carrying a “Garmin eTrex Legend”[5], a handy GPS receiver. mPATH FRAMEWORK Table 1 shows the amount of captured data. While the experiment, he took pictures as action histories using a digital camera. The track data of the handy GPS and taken pictures are used as a data of development and evaluation of this system. Table 1: Captured Data of GPS and Digital Camera Date Jun. 2003 Jul. 2003 Aug. 2003 Sep. 2003 Oct. 2003 Nov. 2003 Dec. 2003 Jan. 2004 Feb. 2004 Total Average size(byte) 321,310 294,518 473,739 365,855 307,187 193,772 205,621 278,071 292,319 2,732,392 303,599 track 286 297 412 298 287 153 208 215 231 2,387 265 point 5,893 5,336 8,676 6,730 5,608 3,573 3,727 5,135 5,392 50,070 5,563 picture 515 131 218 37 342 54 7 108 5 1,417 157 COMPONENTS FOR mPATH FRAMEWORK We developed several components for mPATH framework. In this section, we classify them into data source, filter and viewer, and introduce components we developed. Data Source To input various action history, we developed several components as data sources. These components access action history and transform the data into standard data format. Time filter extracts action histories during a specified term. Through the GUI of the filter, we can change the term. The operation immediately affects the output on the viewer. Time Filter Speed filter filters data by its speed. It is useful especially for infer transportation from location track data. Speed Filter Formalize filter classifies a group of point data into movements and stops, and outputs as track and point data by two separated sockets. Users can change threshold of time to detect actor’s stop. Formalize Filter Matching filter has two input sockets: one for the map data and the other for a geographic coordinate. The matching filter calculates the name of the specified coordinate from the specified map. Matching Filter Inside count filter, geographical region are divided into a grid. The filter counts input data in every grid. Users can know how many times he or she visited a certain point, i.e. a weight of the point, in an action history. Count Filter Viewer We developed three viewers to visualize action history. To visualize spatial aspect of action history, we developed normal map viewer. In this viewer, every input are ordered by the geographic coordinates, so that generic map-like visualization is realized. This viewer accepts multiple data and overlays them. Figure 2 shows an image of normal map viewer. Normal Map Viewer GPS Location Track Data Source access files of location track information or a GPS device through an RS-232C interface. Acquired data are transformed into a group of points where actor passed by. GPS Location Track Data Source Photo data source deals with image files taken by a digital camera. By reading time stamp and location in EXIF[7] information of the files, the components generate an action history of “taking pictures” in the standard data format. Photo Data Source The component can accept location track data. By matching timestamp of photos with location track data, it estimates location of pictures. Map Data Source Map data source inputs map data in a vector format such as points of station, lines of road and polygons of buildings. Users can detect detail of action history by comparing it with the map data. Since the data was designed for GIS, we can get generic map by rendering them and use the map as a background of visualization. Figure 2: Normal Map Viewer Filters Components of filter has one or more inputs and outputs. They input data of standard format and output the result of processing in the same format. We used table type viewer shown in Figure 3 to visualize data output by the matching filter. In the left column, the name of the place he visited are listed, in center, Table Viewer time of stay are shown. In the right column, the date the user begin to stay are shown. Figure 5: Visual Programming Window Figure 3: Table Viewer filter of data. By clicking two icons on canvas, two components can be connected and disconnected. To represent highly abstracted context of action, we developed weight map viewer. The viewer accepts weight of regions in addition to normal action history or geographic information. This viewer visualize the weight as a color or scale of each region as shown in Figure 4. Weight Map Viewer Figure 6: Example of Visual Programming Figure 4: Weight Map VISUAL PROGRAMMING In this section we introduce a visual programming manner of mPATH system by creating map like viewer of action history. The main window of the mPATH framework consists of mainly two windows, one is a pallette and the other is a canvas. Figure 5 shows the detail. In the pallette, components of data sources, viewers and filters are registered as icons. By dragging and dropping of the icon, a component is copied and registered to the canvas. In the canvas, every data flow is shown as lines from left to right. Each icon of component has sockets of data, Right socket means an output of the data, and left socket means an input. Icons only with right socket are data sources, and icons without right socket are viewers. Icons with both reft and right socket are components of analysis method and work as Figure 6-(1) is a simple visualization of location track data of GPS. In this program, GPS data are input to normalize filter. Only track data are input to a normal map viewer and shown in the geographic coordinates. The result of visualization is Figure 2-(1). When we connect both track and point socket to the viewer, we can see stop point data with track data as shown in Figure 2-(2). The operation on the canvas is immediately reflected on the viewer, therefore an interactive operation is accomplished. By insertion of time filter between GPS data source and formalize filter as shown in Figure 6-(2), we can control the term of location track data. To visualize the detail of location track data, we use a map as a background. Figure 2-(3) is a result of connection between map data source and normal map viewer directory. By inputting the output of map data source to the viewer of track data as Figure 6-(3), we can see track data with normal data as Figure 2-(4). We can join data of two inputs by comparing geographic coordinates of each input by a matching filter. By utilizing this filter, we can acquire a name of a stop point from a map. Figure 6-(4) is a program which utilize matching filter to acquire names of stop points by inputting a map data and a formalized filter. In this program, we visualize the output as a list by using table viewer shown in Figure 3. 4. preNotification(ActionFilter fromFilter): This function is called when the state of upper ActonFilter given as an argument is changed. After processing this function, the change event is transmitted to lower ActionFilters. DEVELOPMENT OF A DATA SOURCE, A FILTER AND A VIEWER When developing a viewer, it is needed to refresh the output of the viewer in this function. If ready-made data sources, filters or viewers do not satisfy users, they can also develop original one with Java code. We provide a skeleton of them, users can create a new data source, filter and viewer by extending four functions in the skeleton. Inside the mPATH framework, data sources, filters and viewers are designed as the same object named ActionFilter. An ActionFilter can have any number of input and output connector. If an ActionFilter is equipped with no input connector, it works as a data source. If an ActionFilter is equipped with no output connector, it works as a viewer. An ActionFilter with both input and output connector works as a filter. Since a data transfer mechanism in the mPATH framework is designed as an demand-driven style, filtering logic is implemented in a function which returns result of the filter. A messaging mechanism to notify change of upper ActionFilter is equipped to realize data-driven analysis. When the lowest ActionFilter receives a change event, it demands newer result of upper ActionFilters and accomplishes data-driven analysis. Following functions are prepared for developers. 1. getActionElement(GeoShape area): This function was called when analysis result is required by lower ActionFilters. An area information is given as an argument, the function must return all analyzed result in a format of ActionElement. When developing a filter, a developer implements logic of the filter in the function. When developing a data source, this function is utilized to acquire action history and form them into the unified type of data. 2. afterConnectFrom(ActionFilter fromFilter): This function is called when other ActionFilter is connected to upper Socket. A connected ActionFilter is noticed as an argument. When developing a viewer, it is needed to change the output of the viewer when this function is called. APPLICATIONS In addition to normal map viewer application mentioned as an example, we developed two applications on the mPATH framework. First one is listing points of interest system by analyzing stay stop in a certain place. Second one is a visualization system of travel activity focusing on traveler’s interests on the places. Listing Points of Interest System We developed extraction system of user’s interest point on the mPATH framework. An interest point is a place where user did something or stayed for a long time. We also detect name of the point using matching filter. In this system, we visualized interest points in a table. Figure 3 shows the result of the visualization. Visualization with Weight We developed an weigh method to classify regions by user’s activity. In our algorithm, we at first divide the region into small grids. In each grid, we count the number of times we visited, the number of times we took pictures, and the number of times we shopped. Then we add each value and obtain the weight of each cell as a number. We implemented this algorithm on the mPATH framework, and developed a visualization system reflecting the weight of the region. Figure 4 shows an example. In the example, we weighed each cell mainly by the number of pictures actor’s took, and the weight of each cell is visualized as a scale. We realized showing photo by clicking on the map, this system can be used as a photo viewer. EVALUATION In this section we evaluate the performance of the mPATH framework. It shows if the response of the system is enough to develop interactive development of visualization methods. It can also be used to find functions which should be improved. For measuring the performance, we used the environment shown in table 2 and used data shown in table 3. Table 2: Evaluation Environment 3. afterDisconnectFrom(ActionFilter fromFilter): This function is called when connected upper ActionFilter is disconnected. A disconnected ActionFilter is noticed as an argument. When developing a viewer, it is needed to erase the output of the viewer when this function is called. CPU memory OS JDK Pentium4 2.53GHz 1024MB Linux 2.4.22 J2SDK 1.4.2 03 Table 3: Data Type description size GPS location tracking Jun. 2003 – Nov. 2003 1,956,381 byte Map Data Fujisawa city 9,261,000 byte Overhead of component architecture We measured the performance of the component architecture, therefore the architecture seemed to be an overhead compared to a hard-coding implementation. We measured simple visualization method with several time filters, in which we changed the number of the time filter. Figure 8: Application Performance like SQL. DFQL[4] is one of the graphical query languages for the use in scientific database. DFQL uses data flow language and enable analysis and visualization in addition to data retrieval which general graphical query languages focus on. Max/MSP Figure 7: Measurement of the Overhead Figure 7 shows the result of the measurement. While we increased time filter from zero to 15, the growth of the time is small. This result shows that the overhead of connecting analysis components is small. Evaluation of the Performance of the Visualization on the System We measured the performance of a visualization system constructed on the mPATH framework. We used simple visualization application in which location track data and map data are rendered on a same window, and measured the time required for rendering. Figure 8 shows the result. It also shows the result of rendering single location or map. Lowest case of rendering time is about 800ms, and in the case of small scale maps, it takes more than 1500ms. In the overlaid case, the rendering time of the map are two times the single rendering of the map, and can be reduced. RELATED WORKS We introduce several researches and products as related work of mPATH framework. These systems do not treat action history, but provide flexibility of data analysis and representation by visual language manner. DFQL To improve query language of databases, various graphical query languages are proposed instead of text based language Max/MSP[2] is a visual programming system for midi and audio signal. The system enables interactive processing of music stream with a graphical data flow language and is used to create electronic musical instruments or effectors with original sound algorism, and interactive media systems. The visual language is used by many creator of electronic media and fruits of the programming are exchanged widely on the Internet. Since modules of the visual language can be developed with C language, many original modules are also distributed. CONCLUSION In this paper, we presented a programmable analysis and visualization framework for action histories, called mPATH framework. The mPATH framework provides data flow visual language, and enable flexible and interactive analysis by connecting analysis components through mouse operation. By component architecture, the framework enable providing various visualizations which represent various aspects of action history. We implemented the mPATH framework with Java language, and demonstrate constructing various viewer applications. We also evaluated performance of the framework, and proved its interactivity. We are planning to extend mPATH framework especially focusing on following issues. We are now implementing a mechanism to share a program on the mPATH framework. We designed XML based description as a file format of programs on the mPATH framework. We will implement functions to import and export the XML. We are also planning to provide a server on the Internet to upload and download the XML files. Implementation of a Sharing Mechanism We will extend the visual language and enable visualization of parameters of each modules. We will enable parameters of models to be treated as a modules in the system. This feature will realize coordination of each module and enable more complex analysis and visualization. Enable Visual Programming of Parameters of Modules REFERENCES 1. D. Ashbrook and T. Starner. Learning Significant Locations and Predicting User Movement with GPS. In Sixth International Symposium on Wearable Computers(ISWC 2002), pages 101–108, October 2002. 2. Cycling’74. Max/MSP. http://www.cycling74.com/products/maxmsp.html. 3. Dave Wissenbach. Wissenbach Map3D. http://myweb.cableone.net/cdwissenbach/map.html. 4. S. Dogru, V. Rajan, K. Rieck, J. R. Slagle, B. S. Tjan, and Y. Wang. A Graphical Data Flow Language for Retrieval, Analysis, and Visualization of a Scientific Database. Journal of Visual Languages & Computing, 7(3):247–265, 1996. 5. Garmin Ltd. Garmin eTrex Legend, 2001. http://www.garmin.com/products/etrexLegend/. 6. Hiroya Tanaka and Masatoshi Arikawa and Ryosuke Shibazaki. Extensive Pseudo 3-D Spaces with Superposed Photographs. In Proceedings of Internet Imaging III and SPIE Electronic Imaging, pages 19–25, January 2002. 7. JEIDA. Digital Still Camera Image File Format Standard (Exchangeable image file format for Digital Still Cameras: Exif) Version 2.1. 1998. 8. W. M. Newman, M. A. Eldridge, and M. G. Lamming. PEPYS: Generating Autobiographies by Automatic Tracking. In Proceedings of ECSCW ’91, pages 175–188, September 1991. 9. D. J. Patterson, L. Liao, D. Fox, and H. Kautz. Inferring High-Level Behavior from Low-Level Sensors. In Proceedings of The Fifth International Conference on Ubiquitous Computing (UBICOMP2003), pages 73– 89, 2003. 10. Tomohiko Sugimoto. http://www.kashmir3d.com. KASHMIR 3D. 11. K. Toyama, R. Logan, and A. Roseway. Geographic Location Tags on Digital Images. In Proceedings of the eleventh ACM international conference on Multimedia, pages 156–166. ACM Press, 2003. 12. J. Wolf, R. Guensler, and W. Bachman. Elimination of the travel diary: An experiment to derive trip purpose from GPS travel data. Notes from Transportation Research Board, 80th annual meeting, January 2001. Providing Privacy While Being Connected Natalia Romero, Panos Markopoulos, Eindhoven University of Technology Den Dolech 2, 5600MB Eindhoven The Netherlands n.a.romero@tue.nl INTRODUCTION Privacy is typically studied as conflicts of information and data security or human rights issues. However a broader view of privacy focuses on how people choose to share as well as to keep for themselves personal information. This research examine this later view, studying what people want to share, when, how and to whom in the context of Awareness Systems. SUPPORTING INFORMAL SOCIAL COMMUNICATION We describe research into supporting the leisure (non work related) use of communication media and more specifically of Awareness Systems. Awareness Systems are meant to provide a low effort, background communication channel that occupies the periphery of the attention of the user, and which helps this person stay aware of the activity of another person or group. Awareness Systems do not aim directly to support information exchange tasks, as for example e-mail and telephone calls do. Rather, the awareness they aim to create is similar to the awareness of people in surrounding offices at work or of one’s neighbours at home. Such awareness is built by tacitly synthesizing cues of people’s presence and activities, e.g., footsteps on the corridor, and discussions in the street outside. In many cases, these cues have very low accuracy, e.g., we can notice that there are people talking but not what they say, but this low accuracy is sufficient for providing this awareness [4, 8]. Awareness Systems have been studied in the work environment, starting from the Media Spaces work [3], at Xerox. Research into leisure and especially domestic use of Awareness Systems is more recent. In our research, we study and envision the use of such systems to support the communication between people with an existing and close social relationship. A solution to providing an awareness system for helping family members stay in touch through the day is the ASTRA system described in [6]. Here we are concerned mostly with the privacy issues arising in the context of Awareness Systems and more specifically in the context of ASTRA. The ASTRA System The operation of the ASTRA prototype described in [6, 11], is shown in figure 1. The system helps to communicate asynchronously family members who do not live in the same household. An individual takes a picture of a situation she would like to share right away with a specific person or to all person at another household. She composes a message with the picture and a personal note and sends it. A person at home can at, any moment, check the messages sent. The technology used consists of a mobile device (a mobile phone with camera on and GPRS functionality) that captures and sends pictures and notes to a home device (a portable display with touch screen capabilities) that continuously shows the collection of messages that have been sent by members of the other household. For more details of the implementation please refer to [6, 11]. The homebound device uses a spiral visualization to place the messages in a timeline structure where the user at home can navigate between previous and more recent messages. The display offers a shared space where all members of the family can see the messages that have been sent to the family. It also offers a personal space where each member can view the messages that has been sent only to her/him. Figure 1: Connecting mobile user with the household through the ASTRA prototype Pictures plus notes may be used to trigger communication or as conversation props during other communication activities. A field test [6] was executed as part of the ASTRA project, which confirmed (also with quantitative evidence) that the system indeed helps related distributed households to stay in touch and get more involved in each other’s life. From the sender side, results show that by taking pictures and writing handwritten notes the system supports mobile individuals to share moments through the day that they might not feel are sufficiently noteworthy for sharing by means of more intrusive way of communication. From the receiver side, participants indicated that by receiving regularly messages it gives them a lasting sense of awareness about the members of the other household and therefore it makes them feel being much closer to each other’s lives. from phone calls that resemble a ‘doctor interview’ conversation, because she will be able to answer those questions directly from an Awareness System, and therefore concentrate on more nice and meaningful talks. ADDING PERVASIVE FEATURES TO AWARENESS SYSTEMS By adding automation the question is how to deal with the balance of convenience and control when interacting with Awareness Systems. On the one hand it may relieve the user of undesired tasks and therefore support her to focus on the meaningful tasks. On the other hand it may easily become a surveillance system, where the user is unable to control what information about him has been captured and delivered to others. The ASTRA system offers a simple and explicit way of peripheral awareness between family members based on explicit picture-based communication and on manually inputting to the system one’s availability status, e.g., by email, telephone, instant messaging, etc. While the field study has shown that ASTRA provides measurable affective benefits to its users, from a research point of view it is interesting to study to which extent adding more flows of communication and some degree of automation to the system will add more benefits without incurring too many costs. Relevant costs that may be experienced are the loss of autonomy because someone feels watched, the feeling of being obliged to return calls, the disappointment of not receiving an expected answer to a message, etc. Other costs relate to the effort of continually updating one’s status or taking pictures or being disrupted by the arrival of new messages. To a large extent improvements to ASTRA can help alleviate some of the costs mentioned above. Possible extensions include: • • Automatic notification to the sender when a picture has been viewed by the receiver(s). Peripheral awareness of the history of use of the home device by a user. • Automatic presence capturing when a user is using/looking at the home device. • Machine perception (e.g. sensors) to support users to manage their reachability information could help family members to control the disclosing and access of information while avoiding excessive interaction workload. However, besides these benefits automation also incurs costs. For example, there is a tension between providing the desired level of control of what it is captured, shared and displayed and the convenience level of interaction a user wants to engage. Automatic capture can help users to effortlessly maintain sense of each other’s context and activities. For example, an Awareness System can provide cues of the type of situation in which users find themselves (e.g. private or public context). This information can help users to adapt their behaviours between different situations [2]. For example, in an elderly care situation a daughter who is concerned of how her elderly father is doing can refrain Automatic capture offers a good variety of techniques to support these goals. Sensors and logs activities are some of the techniques we want to explore. PRIVACY IN AWARENESS SYSTEMS Awareness Systems technologies provide access to an increase amount of information that is captured by sensing context in physical environment and social situation. This capability is potentially valuable for the consumer who has access to increasing volumes of content, through numerous media, places and times of the day. Critical for ensuring user acceptance is to find a balance between the amount of personal information captured, how it is captured, the way it will be use, etc., and protection of user’s privacy. Typically questions of privacy are interpreted by people to refer to: (a) undue continuous surveillance by third party (what is known as the ‘Big Brother’); (b) unauthorised access to private information. These views are rather restricted, as we can see by a simple consideration of privacy issues in the ASTRA system. In the ASTRA system every user of the home device has their own “area”. This is where they can see postcards sent only to them instead of the household. In the design phase, we decided not to use any authorization process for accessing this area (e.g. login/password). Within a family we can rely on social norms. Family members will normally respect each other’s privacy and refrain from opening doors that they should not [1] or peeping into drawers or personal objects (e.g. a teenager’s diary). Protective and security mechanisms seemed an unnecessary interaction cost inconsistent with the idea of having a low cost/effort communication medium. During the field tests, mobile individuals did not send at all to persons direct. They indicated that nothing was too personal in nature as they appreciated communicating at once to the whole household and not just one member. In conclusion privacy issues emerge already without introducing surveillance; however a feeling of being under surveillance can arise when a constantly on communication channel is open and when a feeling of obligation to interact is felt. Also, privacy and information security are not synonymous. Privacy management can be achieved by social interactions and social rules within a group of people and may concern equally well the will to share information instead of protecting it only. Awareness Systems can lead to us knowing more than we want about friends and family, breaching their privacy or creating embarrassment For example, the parents may unintentionally obtain an overview of their teenage daughters’ social network, or grandma may find out that the grandchild who said she couldn’t visit because of a school trip, is at home listening to music. Besides providing inappropriate amount of information another privacy aspect of Awareness Systems concerns the failure to establish appropriate interaction/communication patterns. E.g., an always-on channel for communication between a mother and her son who is far from home, tells her lots about his daily routine and activities when at home. This can give her a sense of connectedness but may also give rise to an undesired level of engagement whenever the son is at home, even if he just wants to stay there without having to interact with anyone. Looking at these two kinds of privacy failures in Awareness Systems, it seems crucial to enable users to regulate the process of privacy management. Our aim is to design a “Privacy Profile Interface” (PPI) for Awareness Systems that helps the user determine their own balance between their needs for communication and privacy. PRIVACY IN SOCIAL PSYCHOLOGY Early works like Westin’s [14] and Altman’s [1] theory study privacy from a social perspective, i.e. pertaining to human-human unmediated social interactions. Both these works conceptualise privacy as a dynamic process between the desire of being alone and the desire of interacting with others. Westin’s theory of privacy states and functions [14] has been an influential discussion on the ways people might want to achieve privacy, focusing on different ways and reasons for individuals to be alone or to be left alone. He identifies four different types of privacy (solitude, anonymity, intimacy and reserve) used as mechanisms to achieve four purposes or ends of privacy (personal autonomy, emotional release, self-evaluation and limited and protected communication). Without getting into the details of his theory we can clearly see that privacy may refer to groups as well as individuals level, can be affected through physical separation or behavioural mechanisms of people. Altman takes a broader view than Westin consider privacy as a dialectic process by which people manage the extent to which they are accessible to the environment. He defines behavioural mechanisms (verbal and non-verbal behaviour, personal space and territory, and cultural defined norms and practices) for privacy regulation. He includes in his theory both social and environmental psychological concepts and describes how environment use by people is used to manage privacy (e.g. territory, personal space) and how these mechanisms affect regulation of social interaction looking at both input (e.g. regulating who visits, being observed) and output (e.g. disclosing to another) aspects of privacy. Although these theories do not cover the high complexity that privacy brings about when trying to study its impact in awareness systems, it gives us a good framework to conceptualize privacy in the context of human social behaviours and human social needs. Mediated Social Communication Social communication can be characterized as an interaction need of users to exchange information, and an outeraction [9] need that comprises several conversational processes outside the exchange of information to reach out others for communication. Following the same idea, privacy concerns can be divided under two perspectives: information and interactional control perspective. An information perspective addresses privacy of the information content communicated: what information users want to exchange? How? When? To whom?. An interactional perspective addresses privacy of the outeraction needs for communication: what behavioural mechanisms users need? In which context? How to support them?. Rather than controlling access to Personal Information (PI) an interaction control perspective encourages users to develop their own social mechanisms to address the problem of interruption undesired communication. PRIVACY MANAGEMENT Recent studies [7, 12, 15] mainly focused in the mobile communication domain have developed several techniques to address these conflicts. We vision the state of art of privacy in Awareness Systems in terms of the distinction between information and interaction control perspective. Information Control Perspective Most of the works done try to facilitate communication by helping users to control their own PI and to access other’s PI. Two examples of such systems are Personal Level Routing and Presence Cues, described below. The Personal-level Routing [12] is a personal proxy to maintain person-to-person reachability. It is a rule-based engine that by asking users to set their own rules, it offers them a routing service that tracks location, converts message’s format and forwards it to the proper communication medium. It protects privacy by hiding location information and by filtering and routing incoming messages according to user’s desires. A clear constraint of this solution is that users need to interact with complex interfaces to explicitly set their own rules. The Presence Cues project [7] offers presence cues for telephone users that display dynamic information of the recipient’s reached number and how available s/he is for the next call, in what they called a “life address book”. In this case presence information is based on availability, current reachable number and personalized status messages. It requires users to update explicitly their own presence information when automatically detecting a potential updating situation, offering also a multiple-devices access to actually perform the update. By this means it tries to address the trade-off between overheads vs. control of information. Although this solution provides a good balance between automatic versus manual updating it underestimates the highly dynamic aspect of availability information that needs to be constantly updated. In consequence it was not valued as a reliable and useful social cue in their tests. Interaction Control Perspective Interaction control perspective communication activities faces conflicts: 1. 2. in mediated social two mayor privacy Interactional commitment or attentional contract [15] refers to the level of engagement both recipient and initiator are willing to convey in their current communication activity. For example, it could be phrased in terms of desired effort to put: ‘a short chat’, ‘a long talk’, ‘just a note’, or in terms of which mediums is chosen: ‘only text’, ‘only voice’, ‘only image’, ‘video’, etc. A typical conflict scenario will be how to negotiate the initiator’s intention for communication with the recipient’s desired level of commitment. There is a natural asymmetry between initiator and recipient refers to the unbalanced power that the initiator has over the recipient mainly when starting a communication activity. Push-to-talk [15] represents the idea of protection of privacy by an interactive negotiation. Based on cellular radio technology it offers direct and accessible communication channel between small groups of people. It covers several styles of conversation like bursty, intermittent and focused. Instead of relying on automatic management of users’ reachability, it relies upon lightweight social interaction mechanisms to avoid undesired levels of engagement when communicating. For example, plausible deniability of presence by the recipient helps to negotiate the intention of the initiator with the desired commitment of the recipient by that time with low social cost. Delaying/omitting responses, provides a more relaxed protocol where expectations or obligations are not strong enough to overrule the personal desire at that time of interacting with another person. Decreased costs for openings/closings makes it easier for both the initiator and the recipient to propose and/or to reject an initiation of a conversation without feeling too much responsibility on that action. While it seems to be a very effective solution to protect privacy its success is mostly based on supporting only small groups of people where a (high) level of socialknowledge already exists. The design question here is: to which extent can aware systems afford sufficiently numerous and flexible such mechanisms to support users control their social interactions?. FUTURE WORK The every day perception of the term privacy is associated with threats, violations, misuse, etc. of personal information. By answering the question of what do individuals NOT want to share will clearly leads us to an unlimited list of issues. Our approach proposes to observe users’ attitudes and behaviours when using awareness systems. This can help identify privacy requirements from such systems, answering the question of what information about themselves DO individuals want to share, with whom, at what contexts/times and for what purposes. In this sense, awareness systems provide a sociable way to study privacy requirements. We examine the sharing of information and the negotiation of information communication channels, when a social purpose is pursued and when the social image of a person is concerned. (How their “self” is presented.) Two major design tradeoffs play a crucial role: • Informativeness vs. privacy, has to deal with how much personal information a user needs and wants to convey without violating his/her own privacy. • Overhead vs. control, has to do with how a user wants to maintain his/her own personal information. We aim to investigate to which extent does information management become an excessive workload for the user and whether people can and are willing to control over privacy management of awareness systems. The two perspective to study privacy As introduced and explained in previous chapters, we propose two different perspectives to study privacy in awareness systems: information and interactional control perspectives. Based on the literature findings previously described and taking advantage of an existing awareness system, the following proposal describes how the ASTRA system could be extended to address privacy from these two different angles. The ASTRA system will be extended with a PPI (Privacy Profile Interface) to allow for management of a person’s privacy using mechanisms that correspond to both these perspective. The extended system will be tested in order to validate and generalize concepts of privacy regulation to help users with the dynamic process of privacy management in awareness systems. Information Control of PPI The main objective is to address privacy concerns based on disclosing, control, and access of information depending on the type of information exchanged: • • Information awareness that facilitates communication can be provided by means of personal information (e.g. availability, location, reachability), context cues (e.g. office hours, traffic jam, sport night, holidays, etc.) and social cues (e.g. dinner time, social evening, family meeting, etc.) Information content that is exchanged during communication, where factors like sensitivity, relevance, temporality, etc. influences how to deal with privacy. awareness systems. These policies should support different levels of automation when sharing information based on content shared, the circumstances and the audience involved. This might guide us on the creation of a proper design interaction framework to offer a build-up privacy model when designing awareness systems. Expectations from the Workshop The focus of attention of this research can be described in the following list of research questions: Information perspective • What information do people want to be captured implicitly by automatic capturing technique and explicitly by input devices? How to represent information of one’s actions with respect to a specific receiver? • What information is temporality sensitive (becomes history) when log applications are provided? How to provide the proper interpretation of past, present and future actions? Interactional Control of PPI The main objective is to address privacy concerns based on the choice and use by the user of awareness mediated mechanisms: • • • From the initiator point of view a major privacy need relates to control over connection failure. This can be supported by “preambles” where the initiator can be informed of the readiness of the recipient for communication, before attempting to make a contact. The chance of easily switching media can be another solution helping the initiator to choose the proper media for a successful connection. From the recipient point of view a major privacy need relates to control the timing of a communication. For this purpose several mechanisms can be used: screening of messages so that messages can be easily masked without interrupting other activities of the recipient; plausible deniability of presence by which the recipient decides whether to show to the initiator that she is there or not; delaying/omitting response by which the recipient can decide whether to react or not on a response without incurring in high cost for not answering a message. From both recipient and initiator point of view the possibility to collectively control interactional commitment and desired level of engagement are others mechanisms for regulation of privacy. Interesting examples are: (1) lightweight openings and closings with no need of fixed protocols (how are you, I need to hang up now, etc.) that makes it easier for the initiator to propose a contact and easier for the recipient to engage or reject it; (2) lightweight swapping of activities; (3) reduced feedback/accountability where less awareness may lead to less expectations and obligations. CONCLUSION This project looks forward to define a set of policies that will ensure a proper balance between the communication benefits and privacy costs that are experienced by users of Interaction perspective • What are the desired levels of feedback (accountability) user want when sensor capturing occurs? How to provide understanding and anticipation of how one’s actions appear to others? • What are the desired levels of control for the receiver over the information displayed? Decision of what to view, when and how to view it. REFERENCES [1] Altman, I. The environment and social behaviour. Brooks/Cole., Monterey, CA, 1975. [2] Anne Adams, M.A.S., Privacy Issues in Ubiquitous Multimedia Environments: Wake Sleeping Dogs, or Let Them Lie? Proceedings of Interact '99, International Conference on Human-Computer Interaction, Edinburgh, UK, 1999, IOS Press, IFIP TC.13, 214-221. [3] Bly, S., Harrison, S.R. and Irwin, S., Media Spaces: Bringing People Together in a Video, Audio and Computing Environment. in Communications of the ACM, (1993), 28-47. [4] Eggen, B., Hollemans, G. and Sluis, R.v.d. Exploring and enhancing the home experience. Journal of Cognition Technology and Work, 5. 44-54 , 2001. [5] Langheinrich, M., Privacy by Design - Principles of Privacy-Aware Ubiquitous Systems. Proceedings of International Conference on Ubiquitous Computing Ubicomp, Atlanta, Georgia, 2001, Springer [6] Markopoulos, P., Romero, N., Baren, J.v., IJsselsteijn, W., de Ruyter, B. and Farshchian, B., Keeping in Touch with the Family: Home and Away with the ASTRA Awareness System. To appear in Proceedings CHI 2004, Extended Abstracts, Vienna, 2004, ACM Press. [7] Milewski, A.E. and Smith, T.M., Providing presence cues to telephone users. Proceedings of the 2000 ACM conference on Computer supported cooperative work, Philadelphia, Pennsylvania, United States, 2000, ACM Press, 89-96. [8] Mynatt, E.D., Back, M. and Want, R., Designing Audio Aura. Proceedings of CHI 98, Los Angeles, CA, USA, 1998, 556-573. [9] Nardi, B.A., Whittaker, S. and Bradner, E., Interaction and Outeraction: Instant Messaging in Action. Proceedings of the 2000 ACM conference on Computer supported cooperative work, Philadelphia, Pennsylvania, United States, 2000, ACM Press, 7988. [10] Palen, L. and Dourish, P., Unpacking privacy for a networked world. Proceedings of CHI’03, Ft. Lauderdale, Florida, USA, 2003, ACM Press. [11] Romero, N., van Baren, J., Markopoulos, P., de Ruyter, B. and IJsselsteijn, W., Addressing interpersonal communication needs through ubiquitous connectivity: Home and away. Proceedings of Ambient Intelligence, 2003, SpringerVerlag, 419-431. [12] Roussopoulos, M., Maniatis, P., Swierk, E., Lai, K., Appenzeller, G. and Baker, M., Person-level Routing in the Mobile People Architecture. Proceedings of 2nd USENIX Symposium on Internet Technologies and Systems, Boulder, Colorado, USA, 1999. [13] Sven Meyer, A.R., A survey of research on contextaware homes. Proceedings of Australasian information security workshop conference on ACSW frontiers, (Adelaide, Australia, 2003), Australian Computer Society, Inc, 159 - 168. [14] Westin, A.F. Privacy and Freedom. Atheneum, New York NY, 1967. [15] Woodruff, A. and Aoki, P.M., How push-to-talk makes talk less pushy. Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work, Sanibel Island, Florida, USA, 2003, ACM Press, 170 - 179. Capturing Conversational Participation in a Ubiquitous Sensor Environment Yasuhiro Katagiri ATR Media Information Science Labs. 2-2-2 Hikaridai Keihanna Science City Kyoto Japan +81 774 95 1480 katagiri@atr.jp Mayumi Bono ATR Media Information Science Labs. 2-2-2 Hikaridai Keihanna Science City Kyoto Japan +81 774 95 1466 bono@atr.jp ABSTRACT We propose the application of ubiquitous sensor technology to capturing and analyzing the dynamics of multi-party human-to-human conversational interactions. Based on a model of conversational participation structure, we present an analysis of conversational interactions in the open interaction space of a poster presentation session. We argue that the patterns of transition of the conversational roles each participant plays in conversational interactions can be captured through analysis of the participants' exchange of verbal and non-verbal information in conversations. Furthermore, we suggest that this dynamics of conversation participation structures captured in the ubiquitous sensor environment provides us with a new method for summarizing and displaying human memories and experiences. Keywords Conversation participation, ubiquitous sensor environment, human behavior analysis, non-verbal information INTRODUCTION A natural human conversation requires more than a mere exchange of utterances between conversational participants. Conversational participants first need to be established and mutually admitted into the conversation before they can engage in it. Participants play certain roles in the conversation, and these roles change during the course of the conversation. New participants may join and old participants may leave. These dynamic changes are signaled and managed through the use of various verbal and nonverbal cues. Studies on the structure of conversations and social interactions have been conducted in the fields of sociology and anthropology. However, little work has been done in the CHI community, despite the increasing recognition of the importance of non-verbal information and its functions in human-to-human interactions. Several attempts have recently been made to automatically extract conversational events from speech in two person dialogues (Basu 2002, Noriko Suzuki ATR Media Information Science Labs. 2-2-2 Hikaridai Keihanna Science City Kyoto Japan +81 774 95 1422 noriko@atr.jp Choudhury 2004). We argue that ubiquitous sensor technology provides a useful set of tools that facilitate the empirical examination of human conversational processes from real conversation data, through systematic collection and analysis of non-verbal as well as verbal information exchanged in multi-party conversations. Furthermore, it offers a novel opportunity to share our memories and experiences by exploiting information on fine-grained verbal and non-verbal exchanges in conversations. This data can be utilized in both summary creation and presentation of captured experience data. Dynamic information on the structure of conversation participation can be used to organize and summarize one's personal history of social interaction (Hagita et al., 2003). Elucidating the dynamic transitions of conversation participation structures is also essential to developing robotic/electronic agents that can interact with and help humans in daily life situations. We first present our attempt to capture the dynamics of human conversation participation structures through the analysis of speech turn structures and eye gaze distributions produced by the participants in conversations obtained in our Ubiquitous Sensor Room environment. We then present a simple system that provides conversational participants with real-time information feedback on the status of ongoing conversations. Finally, we show, based on our experiment on collecting poster presentation conversation data, that interaction metrics, associated with conversational participation structures, obtained from ubiquitous sensors provide a good measure of people's interest toward objects and events in their interactive experiences. PARTICIPATION STRUCTURE IN CONVERSATION Participation Structure To engage in a conversational interaction, people first need to establish a conversational space together with their conversational partners, or otherwise enter an existing one, before they can actually talk to each other. Conversational space formation normally proceeds by conversational partners first approaching each other to form a spatial aggregate and then exchanging eye gaze and various forms of greetings. Goffman (1981) analyzed the phases of conversational interaction and defined the internal structure of conversational space as 'participation structure' or 'participation framework.' In conversation, the participants exchange their roles, such as 'Speaker' and 'Addressee,' by exchanging the right of utterance for a moment. The structure of conversation (participation structure) consists of components, e.g., participants, and their interrelationships. Both can change dynamically through the course of conversational progressions. Clark (1996) proposed a model of participant relationships as shown in Figure 1. Figure 1. Participation structure. Clark defined Speaker as the agent of the illocutionary act and Addressee as the participant who is the partner of the joint action that the Speaker is projecting for them to perform. Side Participants take part in the conversation but are not currently being addressed. All other listeners are Overhearers who have no rights or responsibilities in the conversation, that is, they don't take part in it. There are two main types of Overhearers: Bystanders are openly present but not part of the conversation, while Eavesdroppers listen in without the speaker's awareness. of the audience. Although studies have been made on these issues, such as knowledge and information in the participant's mind, by observing the contents and forms of utterances to derive a pragmatic interpretation, there has been little work on non-verbal behaviors. We attempt to elucidate in this study, with the help of ubiquitous sensor technologies, how non-verbal cues, such as body postures and eye gaze distributions, are utilized in the process of audience design, and what effects they have on conversational interactions. CAPTURING ENVIRONMENT In order to empirically examine conversational participation processes from real conversation data and to investigate the possibilities of incorporating conversational participation information in experience sharing technologies, we have set up a Ubiquitous Sensor Room environment and have been collecting a corpus of conversational interaction data in poster presentation settings (Hagita et al., 2003). Figure 2 shows a schematic layout of the Ubiquitous Sensor Room environment. It has several presentation booths, each with its own set of posters and demonstrations, where the exhibitors give poster presentations. The room has a number of cameras and sensors for recording the behaviors of both exhibitors and visitors. Audience Design and Interactivity In a conversation with more than two participants, one person speaks to the others at a time. While a group of listeners are collectively called an 'audience,' they don't have equal rights and responsibilities in listening to the Speaker's utterance. The Speaker can exert control over the progress of conversation by selecting, from among the members of her audience, who is to be Addressee, and she then directs her speech to him. Clark & Carlson (1982) introduced the notion of audience design to capture this phenomenon. The Speaker designs an utterance for a specified listener who is assigned the role of Addressee by making her utterance easily accessible to him through common background between them, for example, by including topics that only the listener knows. Audience design involves both verbal and non-verbal information exchange, and it is expected to create specific interactions between Speaker and Addressee, different from those between Speaker and other non-Addressee members Figure 2. Ubiquitous sensor environment. Two cameras are fixed to the ceiling of each booth to capture human behaviors: placement, inter-personal distance and posture. Furthermore, each participant is equipped with a headset microphone with sensors and a camera that captures speech and approximate gaze direction. We observed the relationships of speech and gaze directions in the conversation by using the recorded data, which indicate patterns of interaction in the participants' verbal and non-verbal behaviors. An interaction corpus in this environment has been collected during the ATR Exhibitions of 2002 and 2003, when a wide variety of people from outside ATR visited poster and oral presentations and joined demonstrations. We have also conducted experimental corpus collection with a limited number of participants. non-verbal as well as verbal information exchanges in managing their conversational participation. Transition of Participation Role Assignment DYNAMICS OF PARTICIPATION Here, we present our initial results of analyzing the Interaction Corpus (Hagita et al., 2003, Bono et al., 2003a) by using concepts of participation structure. We focus on situations in which a third participant (second visitor) joins the already established conversation between two participants (poster exhibitor and first visitor). Two Phases of Participation Clark's model of participation structure indicates a natural organization of two phases of participation: participation in the conversational space and participation in the conversation itself. In a poster presentation conversation, visitors first approach the poster to hear the exhibitor's speech and to look at the poster contents in detail. This reduction of physical distance amounts to the initial participation in the conversational space, i.e., being promoted from a non-participant to a bystander participant. Beyond participation in the conversational space, the participant needs to be further promoted to enter the conversation itself. The participant either takes the floor of the conversation himself/herself (i.e., being promoted to Speaker), is assigned the role of Addressee by the current Speaker (i.e., being promoted to Addressee), or is admitted to join by receiving the Speaker's gaze, namely, the recognition of his/her existence by the existing participants (i.e., being promoted to Side Participant). These are some of the possibilities explaining how participation progresses in conversation. Figure 3. Participation in conversational space. Participation in the conversational space does not necessarily lead to participation in the conversation. Figure 3 shows a conversational scene taken by a fixed camera during a poster presentation. Here, the Exhibitor (E) of the poster and the first visitor (Visitor A: VA) are already engaged in a conversational interaction, which the second visitor (Visitor B: VB) is attempting to join. The scene indicates that VB has approached E and VA, thereby achieving the initial participation in the conversational space. However, the video clip of the scene shows that VB just stayed silently there for about 27 seconds and then left the scene without actually participating in the conversation itself. The activities of the two visitors, VA and VB, were quite different. VA was playing the role of Addressee, while VB was acting as a Bystander. The difference was manifested in the non-verbal behaviors of the participants. E and VA exchanged their eye gazes frequently while they were talking, whereas E did not direct his eye gaze toward VB. E and VA also directed their body postures to each other, away from VB, as if to prevent the newcomer from cutting off their talk. VB, after failing to find a chance to enter the conversation, gave up and left the scene. This small episode clearly indicates that people implicitly rely on Figure 4. Dynamics of participation structure transitions. Figure 4 shows an example of the interplay between nonverbal information exchanges and the switch of conversational roles among participants. The figure shows a sequence of events that took place in a conversational interaction. Each row indicates a new speech turn beginning. The second column from the left indicates the role of each participant in the conversation. The next three columns show the subjects' view data taken by the wearable cameras of three participants. The person standing in front of the poster is the Exhibitor (E), who initially has an absolute right to be a speaker owing to his knowing the contents of the poster and the generally accepted social hierarchy in this situation. The others are visitors who came to listen to the presentation. One of them (Visitor A: VA) came to this booth before the other visitor (Visitor B: VB) (in scene 1). After E and VA exchanged utterances and shared the floor of the conversation for a while, VB arrived (in scene 2). In scene 2, E is Speaker (SPK). Since E is directing his eye gaze toward VA, as is observed from E's view image data at scene 2, VA is assigned an Addressee (ADR) role, and VB is a Side Participant (SPT) in Clark's categorization. In this sequence of events, audience behaviors, that is, the behaviors of VA and VB, are exactly opposite: VA is passive and VB is active. In scene 4, the role of VA was demoted from ADR to SPT, and she did not have the right of turn. On the other hand, VB was promoted from SPT to SPK, so he produced speech and directed it to E. Even though VA stays at this booth longer than VB, VB is more active than VA. This suggests that the duration of staying time alone is not sufficient information for understanding participant activities, particularly their interests in their experiences. Figure 5 summarizes the patterns of transitions in the conversational participation structure. The process of participation in conversation is closely tied to audience design by the Speaker, in the sense that some of the transitions (e.g., Bystander to Side Participant and Side Participant to Addressee) need to be sanctioned by the Speaker. frequently and rapidly among them. We call the first oneway exposition phase a lecture mode (L-mode) conversation and the second two-way interactive discussion phase an interactive mode (I-mode) conversation. Figure 6 shows a typical pattern of speech pause durations inserted in the exhibitor’s talk in both L-mode and I-mode conversations in a presentation session. The figure indicates that the exhibitors continue to speak with only short pauses in L-mode, while they interleave speaking and pausing, e.g., take turns with other participants, in I-mode. This difference in speaking style suggests that it is relatively straightforward to distinguish these two conversational modes in terms of turn dominance ratio. Figure 7 shows an example of the conversational mode display we implemented in ATR Exhibition 2003. The display indicates in real time, for each poster presentation booth, whether the conversation taking place is in L-mode (oneperson sign) or in I-mode (two-person sign). Visitors can choose to visit a lecture session, which is more static and probably easier to join, or to go to a discussion session, which is more active and could be more entertaining. Figure 6. Exhibitor turn dominance ratio in lecture mode and interaction mode conversations. Figure 5. Dynamic transitions of conversational participation structure. AUTOMATIC DETECTION AND DISPLAY FEEDBACK OF CONVERSATIONAL MODES Through examination of many poster presentation sessions, we found that we could distinguish two phases in most of the poster presentations (Bono et al., 2003b). A typical poster presentation starts with the exposition of the contents of the poster by the exhibitor, which is then followed by discussions between the exhibitor and the visitors. In the first exposition phase, the exhibitor takes the Speaker role for most of its entire duration, while in the second discussion phase, the exhibitor and the visitors take turns and the Speaker/Addressee/Side Participant roles switch Figure 7. Conversation mode display. (a) Audience interest and time spent in front of posters. (b) Audience interest and verbal response frequencies. Figure 8. Audience interest and interaction. It has often been suggested that the amount of time people spend in front of some material, be it a web page, merchandise, or an exhibit, can be used to measure how much interest they have in it. Different from these durational measures, our analysis of conversational participation dynamics suggests another set of interest measures from the perspective of interaction. The more involved people get in interactions the more interested they are in the people, objects and events in the interactions. Participation structure dynamics could produce a good measure for human interest in interactive experiences such as conversations. In order to investigate the relationship between conversational participation and audience interest, we conducted an experimental data collection of poster presentation sessions in the Ubiquitous Sensor Room. The experimental set up was exactly the same as that used for our ATR Exhibit corpus collection, but the subjects were recruited and paid to participate in the experiment. Three poster booths were set up with the exhibitors and their posters. A total of 24 subjects participated in the experiment as visitors. After the poster presentation sessions, they were given a questionnaire to gauge their interest in each of the three posters. Figure 8 shows a comparison between a durational measure and an interaction measure for subject interest. Figure 8(a) indicates, for each of the three posters, the number of subjects who showed the most interest in the poster as well as the number of subjects who spent the longest time in front of it. No specific correlations can be seen between interest and duration. Figure 8(b) indicates, for each of the three posters, the number of subjects who showed the most interest in the poster and the number of subjects who produced the largest number of verbal responses, including both speech turns and backchannels, in the interaction with the exhibitor of the poster. We can see that, contrary to the ineffectiveness of the durational measure, the amount of interactive responses makes a good measure for subject interest. When people get involved in conversational interactions, they invariably play the Addressee role, as well as the Speaker role, a number of times during the course of the conversation. Since the Speaker, as a way of her audience design, selects an Addressee by directing her gaze toward him, Speaker gaze allocation could be another candidate for the measure of subject interest. Figure 9 shows the relationship between the visitors' interest toward posters and the temporal duration for which they were given the Speaker's gaze, and hence for which they played the Addressee role. The figure indicates that visitors who ranked a poster the most interesting actually played the Addressee role the most frequently by receiving the Speaker’s gaze for the longest duration in the presentation sessions. There were tending to be significant differences in three rankings of interest (F = 2.82, p = .09) as a result of full factorial ANOVAs using between-subject factors. As a result of multiple comparisons, we found significant difference between ranking of interest No. 1 and No. 3 (p < .05). Addressee in Poster1 (%) AUDIENCE INTEREST THROUGH INTERACTION 80 60 40 20 0 1 2 3 Ranking of Interest Figure 9. Effect of audience design on audience interest to poster 1. These results suggest that the dynamics of conversational participation structures can provide us with a good measure to gauge people's interest in various objects and events. Aspects of this dynamics can be captured with our Ubiquitous Sensor Room environment by examining both verbal and non-verbal signals exchanged in human conversational interactions. These measures could effectively be employed in experience technologies, both in producing summaries of our memories and in providing assistance in presenting and discussing our experiences, through the extraction of interesting episodes that have significant meanings for us. CONCLUSIONS We proposed an application of ubiquitous sensor technology for capturing and analyzing the dynamics of multi-party human-to-human conversational interactions. We presented our analysis results of conversational interactions carried out in open interaction space of a poster presentation session. We showed that multiple channel speech and view data collected for each of the conversational participants, together with pictures taken by ceiling-mounted cameras, provide us with a good source of information from which to identify patterns of dynamic transitions of conversational participation structures. We then argued that this dynamics of conversational structures can be utilized to identify objects and events that are significant for people’s interactive experiences. The implications of our study, although still preliminary and restricted to a small range of interaction types, are promising for future extensions. We believe the methodology developed in this paper, that is, elucidating the use of verbal and non-verbal cues in human-to-human interactions with the help of ubiquitous sensor environment technologies, has huge potential in developing technologies for sharing memories and experiences, particularly when it is combined with automatic signal processing techniques. ACKNOWLEDGMENTS This research was supported in part by the Telecommunications Advancement Organization of Japan. REFERENCES 1. Basu, S. 2002. Conversational Scene Analysis. Ph.D thesis at the Massachusetts institute of Technology. 2. Bono, M., Suzuki, N. and Katagiri, Y. 2003a. An analysis of participation structure in conversation based on Interaction Corpus of ubiquitous sensor data. M.Rauterberg et al. (Eds.) INTERACT 03: Proceedings of the Ninth IFIP TC13 International Conference on Human-Computer Interaction. 713-716. IOS Press. 3. Bono, M., Suzuki, N. and Katagiri, Y. 2003b. An analysis of non-verbal cues for turn-taking through observation of speaker behaviors. ICCS/ASCS-2003: Proceedings of the Joint International Conference on Cognitive Science (CDROM), Elsevier. 4. Choudhury, T, K. 2004. Sensing and modeling human networks. Ph.D thesis at the Massachusetts institute of Technology. 5. Clark, H. H. and Carlson, T. B. 1982 Hearers and speech acts. Language, 58: 332-373. 6. Clark, H. H. 1996 Using language. Cambridge University Press. 7. Goffman, E. 1981 Forms of talk. University of Pennsylvania Press. 8. Goodwin, C. 1981 Conversational organization: Interaction between speakers and hearers. New York: Academic Press. 9. Hagita, N., Kogure, K., Mase, K. and Sumi, Y. 2003 Collaborative Capturing of Experiences with Ubiquitous Sensors and Communication Robots. 2003 IEEE International Conference on Robotics and Automation (IEEE ICRA 2003). 10. Kendon, A. 1990 Conducting interaction. Cambridge University Press. ISBN: 4-902401-01-0