Descrambling the social TV echo chamber
Transcription
Descrambling the social TV echo chamber
Descrambling the Social TV Echo Chamber Nitya Narasimhan 2nd Author Venu Vasudevan Betaworks, Applied Research Motorola Mobility, Inc. Libertyville, IL 60048 2nd author's affiliation 1st line of address 2nd line of address Telephone number, incl. country Betaworks, Applied Research Motorola Mobility, Inc. Libertyville, IL 60048 nitya@motorola.com ABSTRACT one location (home), from a single source (cable), and at a scheduled time (broadcast). Now, x-shifted viewing is on the rise with more users watching content across devices (mobile PC, TV), locations (home, café, transit), times (recorded, ondemand) and sources (providers, mobile, web portals). Ubiquitous broadband access and increased adoption of tablet and smartphone devices is driving new trends in x-shifting and multiscreen viewing for television. These create audience and attention fragmentation problems that impact viewer engagement. But they also fuel demand for second screen experiences that can enhance first screen content. This dichotomy calls for better understanding and modeling of viewer behavior at a micro-level that can support the design of attention-aware, engagement-boosting experiences. In this paper we report on an exploration of social activity around television from two perspectives: quality and quantity of attention or behavior ‘signal’ in the social noise, and ease of extraction and presentation of signals as real-time ‘hints’ to mobile applications. The process taught us valuable lessons on the cost, constraints and characteristics of different echo chambers for social television. We share these along with early insights into collective attention behaviors observed at different fidelities, some intriguing enough to warrant further investigation. The results show promise for the development of noise cancellers and signal boosters to extract the relevant attention data, but require more work to develop a robust solution for real-time use in second screen companion apps. 2. Viewing was traditionally a lean-back experience with users’ attention on a single screen. Today, multi-screen viewing is growing with users watching content on one screen (TV) and interacting with social networks or applications on a second (mobile or tablet), leading to challenges in divided attention Measurement mechanisms that rely on self-reporting (user diaries) or automated device reports (set meters) are ill equipped to handle these shifts. Self-reporting incurs non-trivial user effort and recall that is exacerbated by shifted-viewing and can cause omissions or inaccuracies. Device-based reporting may not scale easily to cover all the access interfaces and content repositories involved. The net result is that such systems may under-estimate actual engagement. This is leading to efforts like the Nielsen cross-platform initiative1 that can monitor broadcast and online viewing in the home. But, such solutions are not yet complete or comprehensive; they cannot reliably measure place-shifted viewing, nor do they factor divided attention into their calculations. Further, they give macro insights (e.g., was the show engaging or not) but not micro behaviors (e.g., why was the show engaging, or what segments of the show drove the most engagement). As the definition of ‘commercial success’ gets more atomized (e.g., a sports broadcast is judged not just by its x million viewers but by selling y million jerseys), sub-program interest mining becomes less of a curiosity and more of a mandate for measurement and monetization solutions. Categories and Subject Descriptors H.1.2 [User/Machine Systems]: Human Information Processing Keywords Activity streams, attention management, social sensor, viewer engagement, social television 1. INTRODUCTION However, growing multi-screen usage is also driving the market for ‘companion applications’ on a second screen. Over 40 percent of tablet or smartphone users multi-task while watching television [1], undertaking tasks that often fall into three buckets: search (for information), social (for conversation) and interstitial (for quick, opportune tasks during perceived lulls in the first screen content). By characterizing the type and duration of second screen activity, we can establish a ‘divided attention’ factor to adjust engagement metrics. For instance, related search or social actions are additive while interstitial or unrelated tasks may prove dilutive to viewers’ engagement in that context. Television is a multi-billion dollar enterprise that generates a bulk of its revenue from advertising, merchandise and content sales. The cost for producing commercial content is prohibitive; analysts believe that HBO’s Game Of Thrones cost over 50 million dollars for just one season. Returns on such investment are measured by viewer engagement; higher ratings translate into premium ad rates and boost merchandise and content sales. As a result, the industry relies heavily on audience measurements like Nielsen’s ratings, to evaluate the performance of their offerings. However, changing viewer behaviors pose new challenges. For instance: 1. venu.vasudevan@motorola.com Viewers traditionally watched content on one device (TV), in Further, second screen applications promote social conversation (regardless of shifted viewing context), with social activity data often containing that context explicitly (e.g., the user tweets about watching a show on Hulu, hinting at a source-shift) or implicitly (e.g., Twitter annotates location, time-zone or app-name to tweets, hinting at place, time or device shifts). By characterizing the shift- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MCSS’12, June 25, 2012, Low Wood Bay, Lake District, UK. Copyright 2012 ACM 978-1-4503-1324-7/12/06...$10.00. 1 33 “Cross Platform Measurement”, http://bit.ly/Nielsen-cross-platform viewer attention to it. This distinction may become important in usage of this work, in both provider and application contexts. based features and weighting them suitably, we establish ‘x-shift’ factors to adjust viewer engagement metrics further. Thus, we can weight live viewing higher than deferred viewing (time-shift) or prioritize remote viewing over in-home viewing (place-shift) to acknowledge the additional effort required to find a local station or setup remote streaming, in order to sustain that live viewing. In the following sections we describe the decision process behind selecting relevant data sources and workflow behind the analysis. We present preliminary insights, with a focus on the more unusual (or counter-intuitive) results; we found valid explanation for some behaviors but could only develop hypotheses for others, requiring further experimentation for validation. The paper concludes with a brief discussion of related work and an outline of next steps towards design of a system that can meaningfully incorporate the results into a real-time ‘attention sensing’ capability for mobile devices and their resident second-screen applications. While this complements the macro-level metrics available today, the real utility of social conversations are in the micro-behaviors they expose during the show. Not only does collective data reveal peaks and valleys of interest during the show, but it also surfaces data related to divided attention. Thus, if interstitial or unrelated companion tasks trigger a social update, its chronological position in that users’ social activity stream can highlight an attention shift to (or away from) surrounding updates related to the first screen. By analyzing streams for statistics on frequency, content types and inter-cluster update times, we can potentially create a dividedattention adjustment for viewer engagement. 2. SELECTING DATA SOURCES The first step involved selecting relevant sources of social activity data for television viewing. Three types of sources exist: generic social networks (e.g., Twitter2), specialized social networks (e.g., GetGlue3) and social analytics services (e.g., Topsy4). Finally, we note that while social data has value to enhancing our knowledge and characterization of viewer behaviors, it may not be representative of all viewer populations. Thus, older viewers may not embrace social media to the same extent, while younger users may more readily adopt new social paradigms like ‘check-ins’. Or we may find that the same user is more engaged socially in certain contexts (shows, shift state) than others. Thus, a goal for our work is to also understand if (and how) viewer profiles and preferences fit into our model. Ideally, we see two benefits to the work. First, as a ‘complementary’ model to existing engagement systems – where the adjustments we compute are further weighted to reflect some representation bias for the larger viewer audience. Second, and perhaps more interesting to us, as a mechanism that reveals ‘hints’ about user attention or behaviors to applications, enabling them to adapt or create richer companion experiences for content. In that context, mobile devices and ‘second screen’ applications serve two critical roles: they are producers of social activity traces (given rich shift context and inherent indicator of screen focus), and they are consumers of the attention model (notably using collective attention hints to enhance the individual experience). In particular, the link between mobility and measurement of viewer engagement cannot be understated; if we buy into the premise that situation is commitment – where the inconvenience and increased user effort in x-shifted viewing speaks to a higher degree of user loyalty to the show – then mobile devices provide implicit and fine grained context that cannot be replicated by in-home systems. Generic networks offer a breadth of coverage of viewer activities across application domains. On the plus side, this allows capture of non-television activities (e.g., to track divided attention modes) and a broader understanding of user behaviors {before, during and after} each viewing session. On the minus side, this adds noise to content-related signals in the stream, requiring more intelligence to detect and suppress or attenuate the spurious data. Specialized networks offer a depth of coverage of user activities in a given domain. For television, services like GetGlue and Miso allow users to broadcast what they are watching (‘check-in’), chat with peer viewers (‘comment’) and earn digital rewards (badges, points, bonus content) for participating. On the plus side, it makes for cleaner signals since all activity data is relevant and there is less room for the disambiguation errors plaguing generic streams. On the minus side, the streams can be sparse; users often check-in once per show (to accrue rewards) but mute their conversations after, or move them to a second network. In this context however, check-in data is invaluable; not only does it correlate directly to a television-viewing event, but analysis shows it correlates well to Nielsen ratings in gauging relative audience engagement [2]. Social analytics services are different in that they provide topical intelligence (collective) rather than user-level activity (raw data). This is useful for gaining broader perspective or overcoming the limitations (or gaps) in data provided directly by the networks. For instance, the Topsy analytics service provides a histogram of Twitter activity for a given topic (query), sampling rate (slice) and requested window (period). With adept usage, it is fairly simple to retrieve historical patterns of behavior for a given show extending back hours, days, weeks, even months, from the present. Twitter’s search API, by comparison, caps query results to ~1500 items. For a high-traffic topic (e.g., Superbowl) this window is a few hours. These observations motivated our initial experiments to determine the viability of social activity streams as a mechanism to detect and characterize television viewer behaviors. For convenience, we identified two granularities of collective viewing behavior: macro (patterns across items that are indicative of shifted viewing) and micro (patterns within a single item, or indicative of divided attention). Macro trends can help identify when, where or how viewers watch a content series; micro trends help in determining which screen they are focused on at any point of time, and why. We also indulge in a small but relevant debate on terminology – engagement and attention find interchangeable use in categorizing viewer presence in a show’s audience. However, in our context, attention is the specific act of ‘watching’ the content; engagement covers any activity with or about it. Attention implies engagement but the reverse does not always hold true. Rather, we see certain types of engagement data as ‘hints’ of user attention; one check-in can suggest attention but a flood of clustered social updates may speak more to engagement with the content than to sustained For our purpose, we used Twitter for real-time data collection and Topsy for historical data retrieval. We intend to add GetGlue as our specialized network in the near future. However, for the initial analysis, we leveraged the fact that Twitter is a data sink for other networks; members link Twitter profiles and configure options to auto-post select activities or achievements to their Twitter stream. This makes Twitter fairly effective as a sensor for specific types 2 Twitter: http://www.twitter.com, http://dev.twitter.com GetGlue: http://www.getglue.com, http://getglue.com/api 4 Topsy: http://www.topsy.com, http://otter.topsy.com 3 34 of activities from those networks. As we show later, this is true for GetGlue ‘check-ins’ which represent the key data of interest to us. that show’s GetGlue URI to get a coarse-grained sense of user check-in counts at different slices. In effect, we made 6 API calls per show, which results in <1000 calls to process all shows in our master list. This is well under the API’s 7000 calls/day limit but note that this makes sense for post-hoc analysis; usage in real-time for fine-grained sensing will easily exceed that limit. 3. DATA COLLECTION Our next step was to select target content items for data collection. We generated a master list of candidates from a mix of broadcast and cable channels with emphasis on primetime scheduling. Cable content was further pruned based on good Nielsen ratings or welldefined social presence, to maximize potential for social activity. This resulted in a total of 144 shows. For each, we added relevant metadata (e.g., Twitter account, hashtags, GetGlue URI, channel, title and description) with manual review for quality and accuracy. 3.2 The Twitter Dataset Twitter provides two APIs: streaming and search. For reasons mentioned earlier, the search interface is not immediately useful. Instead, ‘API’ refers only to the streaming version that exposes the Twitter firehose, a real-time stream of social activity updates (tweets) from all users. Different API endpoints provide different views into the stream; we use the filter endpoint, which returns a subset of the stream matching tracking criteria such as keywords or account names. In our case, we filter by the same keywords (hashtags) used earlier with Topsy, for each tracked show. Next, we set up data collection tasks at two granularities: macro (item level, across item occurrences) and micro (segment level, within the item). Our two data sources (Topsy and Twitter) each aligned best with one of these objectives. Topsy is ideal for macro analysis given the ability to retrieve historical trends going back weeks in time. Because Topsy derives its insights primarily from Twitter data, it can be queried using comparable parameters (e.g., hashtags). In principle, Twitter supports macro- (search API) and micro- (streaming API) data collection; in practice, search results are limited in size with limited configuration options. Results that are obtained by search will be relevant, but not deterministic or comprehensive enough for analysis of temporal behaviors. The API has some limitations. By default, this ‘free’ service will return only a representative subset (~1 percent) of the full firehose for requests. Further, that subset is selected (using an unknown relevance metric) to provide a fair distribution of results across all specified criteria. While Twitter allows up to 400 terms in filters, each additional term cannibalizes the share of results associated with pre-existing terms. To put it in context, a filter with terms for a single show is likely to be fairly comprehensive in covering that show; tracking all 144 shows concurrently results not just in poor coverage of any one show, but also incurs client side complexity by requiring a de-multiplexing component that segments that feed into show-specific streams, potentially in real time. 3.1 The Topsy Dataset While Topsy provides various data retrieval APIs, we used only its histogram endpoint to retrieve historical trends. The API is rate limited (to 7000 calls/day) and takes primarily three arguments: a slice (the desired sampling rate in seconds), period (the number of samples requested), and query (topic or keywords). The result is a single JSON object with request parameters (for reference) and an array of integers, each data point representing the number of tweets seen for that query term in that ‘slice’ of time. The size of the array matches the requested period, with data returned in reverse-chronological order; in other words, the first value refers to the slice ending at the request time. To simplify analysis, we convert each data point into a tuple consisting of that data value, and the related start-time for that slice; the latter is computed by progressively decrementing request time by ‘slice’ seconds. As a result, we decided to instead focus on tracking one show at a time, to get a sense of micro-behaviors. For the initial experiment, we deliberately selected four ‘signature’ events with a fairly high expectation of social engagement – two from sports (NCAA Final Four and Championship games) and two from highly-anticipated season premieres (HBO’s Game of Thrones, AMC’s Mad Men). In all cases, we collected data before, during and after the show, in order to also capture pre-show and post-show activity. Given the space constraints, we will focus on analysis of just one dataset from this collection. However, we plan to publish a more detailed report later that explores subtler nuances of cadence and content given the characteristic differences and half-life of these genres. The service also provides an analytics dashboard5 that visualizes data as a simple time series and allows up to three queries to run in parallel to view comparative trends across them. It allows users to set slice and period values implicitly by selecting among four options: {past day, past week, past 2 weeks, past month}. We used this feature for quick insights as we describe later. However, we note that the dashboard uses the same histogram endpoint, making the data comparable to that retrieved from a direct call to the API, with the small difference in time offsets between the two requests. We also note that the API allows us to set the slice/period to any values of our choosing. For instance, {slice=30, period=120} will return a histogram for the past hour sampled at 30sec intervals. In theory this lets us retrieve micro-level data in real-time for shows; however, it offers no details on the content behind these counts. 4. PROCESSING WORKFLOW Fig 1. For our analysis, we ran queries on the Topsy dashboard using the default 1-day, 1-week and 1-month settings per show. Moreover, we ran two concurrent queries for each show: the first focused on the keywords (hashtags) for the show to get a sense of overall user engagement, while the second scoped queries to tweets containing 5 Data Collection and Processing Workflow for Twitter Post-collection processing of Topsy data was fairly minimal. Our only task was to create timestamped tuples corresponding to the distinct histogram data points. By contrast, the wealth of content and context in Twitter data called for some pre-processing to both segment the results into meaningful datasets, and to enhance them where useful for our analysis. Figure 1 illustrates the workflow at a high level. Twitter stream data arrives at a fairly rapid clip and is Topsy Analytics Dashboard: http://analytics.topsy.com 35 ingested and stored prior to processing. To explore hypothesis around subjective vs. objective behaviors, we annotate each tweet with sentiment polarity using Twitter Sentiment6, a bulk-classifier that returns {0=negative, 2=neutral, 4=positive} labels per tweet. The tweets were then fed to a ‘Slicer’ configured with a sampling rate, effectively segmenting the stream into chunks of a specified duration. Tweets in each chunk were analyzed for various markers (related to behaviors) and aggregate tweet counts (per behavior) were registered in a corresponding ‘transcript’ for that chunk. tags that are invariably used in other, non-television contexts. This adds noise to the signal. Fig 2(b) shows histograms for the term “#house” (yellow) the equivalent GetGlue URI (red) and the term “#house and Fox” (blue) to provide contextual relevance; House airs on the Fox channel. Observe that even knowing scheduled slots for the program does not guarantee that tweets retrieved in that interval are related to the content. 3. We identify three broad types of behaviors: basic (that focuses on user activity types e.g., check-in, chat, comment), x-shifted (that is indicative of a shift from norm e.g., timezone bias in east vs. west coast viewers) and n-screen (that is indicative of divided attention between a first screen content and complementary or competitive second screen activity occurring in that interval). The last of these requires additional data retrieval and analysis, and is the focus of ongoing work; we mention it here only for completeness. For the present, our behavior classification is relatively naïve, using some mixture of natural language processing (e.g., we look for known text markers e.g., GetGlue’s distinctive check-in signatures) and Twitter’s rich set of semantic annotations (e.g., the utc-offset that is provided in streaming API results and indicative of timezone of that user). In a subsequent iteration, we plan to evolve analysis to incorporate machine learning and advanced statistical tools that can improve accuracy and expand breadth of identified features. Sensor mash-up adds clarity. Correlating the activity around GetGlue (check-in) with that around a hashtag (tweet) can identify anomalies or increase clarity on behaviors. Fig 2. (c) shows the check-in (red) and tweet (yellow) volumes around a popular wrestling show (WWW Raw); if we assume each check-in represents a user, we can clearly see a non-trivial ratio of tweets to users, indicative of an extremely engaged fan base. In a different example (Fig 2d), tweets (blue) and check-ins (red) are contrasted for ‘2 Broke Girls’ showing a low, steady tweet volume, with one sudden spike. By itself, this could be misconstrued as an unscheduled airing of a new episode; however correlating this with check-in data (which clearly indicates a lack of viewers) highlights an anomaly; a check of tweets in that region indicate the spike related to a CBS announcement of show ‘renewals’ including this one. A clear example of engagement not equating to attention. The behavior transcripts are analogous to a ‘social’ closed caption transcript for the content; they summarize various behavioral insights or contexts corresponding to that slice, and are associated with a timestamp indicative of its start time. The transcript format is also an artifact of our current usage requirements. In particular, we ‘visualize’ transcripts to look for unexpected characteristics, or evidence to validate (or disprove) different hypotheses. For this, we use flot7, a Javascript library that mandates specific format for time-series data; transcripts are generated to meet this by default. 5. PRELIMINARY ANALYSIS Fig 2. (a) #Alcatraz 1-month (c) WWE-Raw 1-month We now share some of the insights obtained from our initial analysis. In some cases, the insights are not conclusive, but rather hint at interesting behaviors that warrant further exploration. Due to space constraints, we also describe only a subset of the insights that we felt were of interest to the broader community. A detailed report on these, and other insights, is forthcoming. We also found value in macro-analysis for understanding change in attention under schedule face-off (i.e., two shows in the same time slot, on competing networks), or regimen changes (e.g. series finale of show A could be traced to increased engagement in faceoff show B the following week). Some of these insights were not conclusive and have been flagged for repeat experimentation. 5.1 Macro-Analysis With Topsy The Topsy dataset was useful in studying broad trends but also in then instrumenting parameters for the Twitter collection phase. A few of the insights obtained using the Topsy analytics dashboard: 1. Sample size. Large slice values expose periodicity in viewer engagement that can be directly correlated to episodic series. In Fig 2(a) we can easily identify the air-times for episodes of Alcatraz and even differentiate new episodes from repeats. However, a micro-analysis (1-day) is more sensitive to slice duration and shows jitter due to transient effects (e.g., a callto-action during a specific episode driving up traffic). 2. Term disambiguation. While Twitter advocates the use of clear ‘hashtags’ for each TV show for cohesive conversation, some shows (e.g., Unforgettable, House, The Office) have 6 7 (b) #House 1-month (d) #2brokegirls 1-month 5.2 Micro-Analysis With Twitter While Topsy data was useful in understanding macro behaviors for a show, it provided no details on the content under the counts. As a result, understanding nuances of engagement (including both x-shifting context and multi-screen behavior) is impossible. We plan to revisit macro-analysis with the Twitter streaming data in due course; for reasons mentioned earlier, we are constrained in the number of concurrent ‘captures’ we can perform with the API. Instead, in this section we focus on micro-analysis of the Twitter data stream around a single show: HBO’s Game of Thrones. The season premiere of the drama was highly-anticipated, with social engagement exceeding expectations [3]. Our filter used a few key terms notably the dominant hashtag (#gameofthrones), a user-friendlier version (game of thrones), and some show-specific themes (e.g., ‘westeros’), which were first validated by the search API to guarantee a respectable volume of usage. The show aired Twitter Sentiment API: https://sites.google.com/site/twittersentimenthelp Flot Visualization Library: http://code.google.com/p/flot/ 36 The Timezone Conundrum. We also have a third possibility. We segmented tweets by the ‘time-zone’ associated with that user’s profile. This is not the same as location, which reflects GPS or place coordinates for current location, and is inevitably undefined. Rather, profile location is at the city level and is always provided. We made the assumption that most viewers will watch content in their home time-zone; the data is visualized in Fig 4 below, with focus on the segment around first airtime. simultaneously (at 9pm EST) on east and central time zones; its latest start-time was on the west coast (at 12am EST). To capture anticipation, we began collection 2 hours ahead of first broadcast (~7pm EST) and targeted a cut-off at 2 hours after last broadcast (~2am EST). In reality, we terminated the capture at 6am EST the next day. We captured ~65K tweets; the fraction of tweets within the broadcast window was relatively less than the ’60,000’ count reported by other sites; we attribute this to our accessing only the partial firehose, and limiting our collection only to Twitter data. Post capture, the data was processed using the workflow shown in Fig 1 earlier, with transcripts saved as files that could be input to the flot visualizer described earlier. All data visualizations shown used this tool with time represented in UTC format (on the x-axis) against tweet counts for the specified feature (on the y-axis). For convenience, we highlight the first broadcast time (1:00-2:00 am UTC, or 9-10 pm EST) as a column in a lighter shade of white. Some data (e.g., sentiment polarity) is not visualized here given space constraints; it may instead be discussed briefly in context. Fig 4. Sample size. The granularity of insights is a direct function of the ‘sample’ size we use in segmenting the Twitter dataset. Fig 3 shows the effect of different sample sizes (0.5s, 1s, 5s, 30s, 1min, 5min) on visualizing engagement in tweet counts; as anticipated, we have few but more distinct peaks and valleys at higher sample sizes (5 min), and more jitter at low sample intervals (0.5 sec). The former allows us to easily detect areas of high or low collective engagement, which could translate to areas of sustained high or low individual attention; intuition is that only an onscreen event of interest could create this near-synchronized spike in chatter across a large segment of the audience. This coherence is diluted at lower sample sizes. However, the tradeoff is in reaction times for mobile applications that want to ‘sense’ and respond to such engagement in real time. Peak-detection is reactive; the system can only report a peak after it computes the next sample count and finds it lower. A 5-minute lag in learning about, and reacting to, a peak-related action is unacceptable. Fig 3. Segmenting tweet volumes based on timezone in user profiles The results are counter-intuitive; total tweets are shown in yellow, east in green, central in blue, mountain in red and pacific in purple for comparison. By this count, east coast viewers have the secondlowest contribution despite being the first to see the show. And, there is a noticeable presence from west coast viewers despite it being three hours ahead of their broadcast time; this was readily explained by observing that HBO (the host cable network) offers branded East and West channels available nationwide. In effect, a west coast viewer can watch content on east coast time by adding a subscription to HBO East. What makes this unique is the realization that west coast viewers are exhibiting a higher level of engagement than their peers by time-shifting consumption “forward”; in essence, they disrupted a normal routine (e.g., 6pm is a commute hour) just to be first in line for the show. The relative paucity in east coast contribution is puzzling and we need to dig deeper. One hypothesis is that 9pm EST (air-time) is ‘late’ for families, potentially encouraging timeshifted viewing; another is that our fundamental assumption (on validity of profile location) is flawed. Regardless, we perceive this to be an under-utilized attribute that can contribute to our better understanding of x-shifted viewing. Check-ins: Noise or Signal? Earlier, we described how check-in data helps identify if tweets in an interval are more likely to relate to attention (watching) or engagement (other activity) for a show. We now observe that in Fig 4, the first and sharpest peak is due to a surge in check-ins as verified by the ‘signature’ pattern of tweet texts. This correlates well with other reports [4] that indicate a marked user preference for ‘checking-in’ right before or after the targeted show. However, clustered check-ins boost tweet counts in a manner that creates a false peak of attention (within the show), and diminishes the signal from real attention events. Thus, in this particular episode, the first real peak of user attention is actually about 12 minutes in; however when compared in magnitude to the check-in peak, it create erroneous impressions of the event being ‘less’ significant in comparison. This implies the need for a signal separator that can isolate the check-in signals for use in boosting macro-level analysis without distorting the micro-level picture. Identifying peaks and valleys at different sample rates Viewing Behavior: Fig 3 also shows collective patterns of microlevel viewing behaviors. The first airing at 1:00-2:00am UTC (on east and central coast) produces the bulk of engagement data; the pattern repeats consistently for other shows. First comes pre-show anticipation (till just after 1:00), then in-show attention (till just before 2:00), and finally post-show reaction. Note the distinct but relatively smaller peaks at 3:00 and 4:00 marks respectively; these correspond to broadcasts in the Mountain and Pacific time-zones. The reason for markedly low chatter is unclear; speculation runs along two lines that need to be validated or disproven. The first points to the ‘nothing left to say’ syndrome; later users routinely find their twitter streams polluted by chatter (and spoilers) from earlier viewers and just lose interest in participation. The second points to an ‘empty room’ effect; at that hour, a significant segment of the US population is turning in for the night, leaving fewer users online for sustained chatter. Word Cloud: For bulk-sentiment extraction, we had to produce a transcript containing just the text of all tweets, which we then also visualized as a word cloud using the popular Wordle8 service. A 8 37 Wordle Net: http://www.wordle.net Rich academic research also exists in this context. A similar UK study [7] focused on using tweets to understand audience behavior in signature events (e.g., X Factor) as design guidance for related second screen apps. Prior work from our group [8] looked at the vertical domain of sports, and was the first to validate the utility of social sensing for gathering micro-level attention signals. But, sports events have maximum impact when viewed live, and don’t suffer the extent of fragmented behaviors that other shows do; our current work seeks to extend that concept meaningfully to a larger television ecosystem. Finally, we hope to build on the extensive research in natural language processing techniques for classifying, summarizing and clustering tweets for various needs. For example recent research on classifying collective attention [9] is relevant in that it allows us to potentially identify (and track) emergent topics around television shows that improve both our data collection and subsequent data filtering mechanisms, potentially alleviating the known issues in term disambiguation and call-to-action pollution. particularly useful Wordle feature allows us to iteratively remove words from the visualized cloud to trigger a refresh in context. By removing anticipated terms (e.g., dominant hashtags) we can peel the layers of the onion to get new ‘hints’ of collective behaviors. Fig 5. Game Of Thrones tweet ‘text’ visualized as a word cloud In Fig 5 we show the word cloud generated from that transcript after we discarded our filter hashtags and we can immediately see some patterns emerge. The terms {@GetGlue, watching, checkedin} all relate to the check-in template used by GetGlue; the size of those words show that a sizeable subset of tweets came from GetGlue and this is borne out by news reports [3]. Next, we notice the dominance of “RT” (retweet), an activity that takes no cognitive effort from users but boosts overall traffic for that show. Tweet analysis showed a significant number of these retweets was from @HBO (another dominant word) stoking anticipation for the premiere. This is in line with Twitter’s own guidelines for driving engagement; however, we notice that for ‘attention’ profiling, the RT can add noise. For instance, some ‘call-to-action’ tweets asked users to ‘RT and be entered into a drawing to win’ merchandize; many obliged, adding noise to the attention signal. To tackle such conflicts of interest we plan to integrate pattern-matching and machine-learning mechanisms similar to spam detection, to cancel or bias select metrics impact of such data in an evolving manner. 7. CONCLUSION 6. RELATED WORK [2] GetGlue Blog. “Analyzing Social Televison: Checkins vs. Nielsen”. http://bit.ly/getglue-nielsen All these factors call for continuous monitoring and understanding of best-practices and real-world patterns of social activity around television to more effectively select, amplify or attenuate signals of interest to second screen applications in specific contexts. In this paper, we described early efforts to collect and analyze social activity data from television viewers, and determine if there was sufficient quality and quantity of ‘signal’ to create differentiated attention and engagement metrics for content. We shared early results that showed ways to boost relevant signals, filter spurious noise and cancel the negative metrics impacts of otherwise valid data. The process also helped us identify processing flaws and intriguing results that warranted study and will be the focus of our next iteration on this research. 8. REFERENCES [1] Nielsen Wire. “40% of Tablet and Smartphone owners ser them while watching TV”. http://bit.ly/nielsen-wire-multitask The domination of Twitter as the digital watercooler for television viewers is not coincidental. More than any other network, Twitter has proactively advocated and supported “best practices” [3, 4] for increasing viewer engagement with television. Since the report, numerous social analytics startups (e.g., Bluefin Labs, Trendrr and SocialGuide) have built on the data to deliver richer social dashboards that complement Nielsen engagement ratings. [3] Mashable. “Game of Thrones Premiere crashes GetGlue; gets 60,000 comments” http://on.mash.to/mashable-got-premiere [4] Miso Blog. Most Miso Users Check in When Shows Start. http://bit.ly/miso-checkins-dist [5] Twitter Developers. Twitter on TV: A Producer’s Guide. https://dev.twitter.com/media/twitter-tv However, our goal is somewhat different. First, these leverage the full firehose, along with proprietary paid access to different backend services; we wanted to understand if a best-effort solution based on a partial firehose could have utility. Second, they current focus on macro-insights (item-level); we are more interested in micro-insights (segment-level) for modeling and understanding fragmented attention and audience behaviors. Third, our ultimate goal is to support application development by abstracting and presenting collective behaviors as ‘hints’ that can be used to develop or enhance a second screen experience. In that context, we believe that advocated best-practices (e.g., add call-to-action hashtags to boost engagement) may be detrimental to our needs; it takes microseconds for a user to ‘RT’ or ‘meme’ tweets in the name of engagement, but the resulting collective noise may be easily misinterpreted as signal if not detected and screened out. [6] Twitter Blog. Watching Together: Twitter and TV. http://bit.ly/twitter-and-tv [7] Lochrie, M. and Coulton P., 2012. Tweeting with the telly on: mobile phones as second screen for TV. Proceedings of the IEEE Consumer Communications and Networking Conf. (January 14 - 17, 2012). CCNC '12. IEEE. Las Vegas, NV. [8] Zhao, S., Zhong, L., Wickramasuriya J., and Vasudevan V. Human as real-time sensors of social and physical events: A case study of twitter and sports games. Arxiv Preprint, 2011 http://arxiv.org/abs/1106.4300 [9] Lehmann J., Goncalves B., Ramasco J. J. and Cattuto C. Proceedings of the 21st International Conference on World Wide Web. WWW ’12. ACM. New York, NY. 38