Triggering Artwork Swaps for Live Animation

– How about now? Better, maybe? Okay (laughs). Thank you for the introduction. Recent advances in
performance-based animation, combined with the increase
of live streaming platforms, have given rise to a new form
of animation where performers control animated characters
for live audiences. One example is a live “Simpsons” episode, which occurred this past May, in which Homer spent three minutes responding to phone-in
questions from viewers. To better understand the
needs of live animators, we interviewed the authors of that episode and several other authors
of performed animations. These include the production team for “The Late Show with Stephen Colbert”, whose animators have worked on roughly 50 live animation performances ranging in length from 30
seconds to six minutes. Another case is the author
of a 30-minute question and answer session on YouTube Live
featuring a cartoon monster. These characters used
in these performances are represented by a set of artwork layers that typically depict different
body parts or accessories. During animations, the layers can either
deform or be swapped. Swapping artwork layers
causes discrete transitions and produced large changes
in pose and appearance. For example, gestures such
as shrugging, pointing, and fist shaking are often portrayed as alternative artwork
layers for arms and hands. All of the interviewed animators
use triggered artwork swaps extensively in their performances. To trigger these artwork swaps, the animators that we interviewed
use keyboard shortcuts. However, they found it difficult
to remember the mapping from keys or buttons
used to trigger the swaps to the corresponding artwork. In fact, “The Simpsons” and
“The Late Show” use the shown annotated keyboards and
dedicated one performer to operating the triggers while
others handled the speaking. Based on the discussion with animators, we identified the following design goals for a live animation triggering interface: the mapping between triggers
and the corresponding artwork should be intuitive so that performers do not have to memorize a
large number of triggers; the triggering interaction
itself should be accurate and predictable to minimize mistakes; the interaction should also
be fast to help animators coordinate swaps with the
rest of their performances; to cover the broadest range of use cases, the system should be accessible and not require highly
specialized input devices or multiple performers
for a single character. An important high-level
design decision for our system is what type of input device to support. We considered several options. Keyboards are very accessible, but the mapping between the keys and the triggers is not intuitive, as noted when interviewing the animators. While physical annotations can help, reconfiguring them for
different characters with different sets of triggers or even normal keyboard
usage is inconvenient. Game controllers are
designed to be responsive, but, as with keyboards, the mapping between input interactions and triggers is not intuitive. Realtime tracking of hand and body poses is another potential input system. Video-based techniques
are the most accessible, but even stated that our algorithms are not accurate enough
for our application. Techniques that rely on
depth data are more accurate, but depth cameras are still less common
than regular cameras. Finally, we decide on
a multi-touch interface for our approach. Another way to explore
triggering interface designs is by looking at previous
work on performed animation and multi-touch triggering. In addition, applying predictions
to subsequent triggers based on predictive user
interfaces would also be useful. Previous performance-based
animation systems explore a variety of techniques for
capturing human performances, including: motion capture,
puppetry with physical props, and direct manipulation via touch. In contrast, we focus on the task on triggering discrete artwork changes typically performed manually
via keyboard shortcuts. We propose a new multi-touch
interface that is more effective and configurable
than keyboard triggering. Previous work has explored
multi-touch triggering interfaces in other domains. Given that text input is a
form of discrete triggering, our problem is related to the design of soft keyboards for touch devices. However, the problem of
arranging animation triggers presents unique challenges. First, the set of animation triggers varies from one character to the next. In addition, the triggers
should be near the position of the corresponding triggered artwork, which puts additional spatial
constraints on the layout. In this respect, our approach is related to label layout algorithms
that encourage labels to appear close to the
corresponding anchor regions. Our interface leverages an
animator’s practice sessions to learn a probabilistic model that predicts the next
trigger during a performance. Previous work on predictive
or adaptive user interfaces include: app or icon selection
on mobile interfaces, menu navigation, and command selection. We investigate the use
of predictive models in the specific task of
live animation triggering. From these ideas, we design and evaluate
a multi-touch interface for triggering artwork swaps
and live animation setting. Our approach leverages two key insights. First, to help users execute
swaps more efficiently, our interface arranges visual triggers that show thumbnail images
of the corresponding artwork around a live preview of the character. This trigger layout design enables users to quickly recognize and tap triggers without looking away from the character. Second, since animators typically practice before live performances, we encode common patterns
from practice sessions in a predictive model that we then use to highlight suggested
triggers during performances. Our highlighted predictions
assist performers in executing common sequences without
preventing them from improvising. Our triggering interface
uses a multi-touch display, which provides a reconfigurable
interactive display surface where we can render the triggers. Our prototype uses
Adobe Character Animator as the realtime rendering
engine for the character, which allows us to support face tracking and audio-driven lip sync, in addition to the core
triggering functionality of our system. Character Animator triggers
artwork swaps with the keyboard. Our system passes keyboard
signals to Character Animator to trigger the
corresponding artwork swaps. Now I will demo our system. Give me a second to set it up. Here you can see a live
preview of the character, which moves when I move, and the mouth is driven by lip sync from the audio that I’m saying. Our system picks up touch input, which you can see by the yellow lines drawn on the interface. Our system has three
different types of triggers. Blue triggers, which trigger a
single swap when I press down or a cycle swap. Orange triggers switch
to different sub poses based on a radial slice
around the main artwork. The orange triggers switch
to different sub poses based on the number of fingers
that I have pressed down. Notice I have three fingers pressed down and the character also
holds up three fingers. We also have a performance mode, but we also have a layout design, so when choosing our
different layout design, the triggers are placed in layout slots that are close to their original artwork. Our goal is to assign each
trigger to a unique slot to minimize the Cos function, which is defined as the
distance between the center of the trigger and the
center to the artwork. We have corresponding left and
right triggers in symmetric slots around the characters
and triggers that correspond to both hands or arms are
placed below the character. We have a row layout, which has left and right triggers in corresponding grids
around the character, and we also have a fan layout
where the triggers are placed in a natural fan around the
character to mimic the natural resting position of the
user’s fingers on the display. To prepare for a live performance, the actor can practice the
script a couple of times and then save each practice. We then leverage these practices to make the triggering easier
during the live performance. Now I’m going to practice a
script and the script appears at the top of the screen
in this teleprompter. “Hi, everyone. “My name is Furiosa. “I am super excited to here at UIST today. “I know that most of you flew in “to join the conference today, “but let me try to convince you “to try this new teleporter watch. “That’s how I arrived in Quebec City. “You’re unsure that it
actually works on humans “and not just animated characters? “Well, I assure you that this
is the new method of travel “for the future.” Now that we’re done practicing,
we can do a performance. Our system builds a markup
model in order to help the user during the performance time. Our model predicts and
highlights the most likely subsequent poses whenever
the user hits a trigger. An animator can use our system with either no suggestions highlighted,
one suggestion highlighted, or three suggestions highlighted. The triggers which are
grayed are not suggested, but the user can feel
free to hit them anyway. The other triggers are colored
and scaled based on their estimated probability
of being the next state. The bigger and brighter
triggers are most likely. Now, let’s perform the
script that I just practiced. “Hi, everyone. “My name is Furiosa. “I am super excited to
be here at UIST today. “I know that most of you flew in “to join the conference today, “but let me try to convince you “to try this new teleporter watch.” Notice that when the practice matches the performances quite closely, the suggestions make it much easier to hit and identify the next trigger. But, I can also improvise
like I’m doing now, and the system is fine. But, the user still has control over the exact timing of the performance, allowing me to pause
and talk to all of you. Now I’m going to finish the script. “That’s how I arrived in Quebec city. “You’re unsure that it
actually works on humans “and not just animated characters? “Well, I assure you that this
is the new method of travel “for the future.” Now I’m going to describe
how our system works. (audience applause) For more details about
the layout optimization and our design choices behind it, which are based on a pilot user study, please see our paper. Now let’s talk about the predictive model. To determine which triggers to highlight during the performance, we create a markup
model from the practices to predict the next triggers. Each practice session is recorded as a sequence of trigger states. When modeling the transitions
between the individual states, a single state does not
provide enough context to give useful predictions. Rather, we must consider
sequences of states or n-grams where n is the length of the sequence. Here, we show examples
of 2-grams and 3-grams. We don’t know which size n will work best for any given performance. N equals one does not
capture enough context, but if n is too large, the system is not perverse
to improvisation or mistakes. Our solution is to construct
an ensemble of markup models each of which uses different size n-grams. At performance time, given
the current trigger state, we compute a weighted combination of the transition
probabilities from every model to determine the most likely next states. We experimented with four
different weighting schemes that set the weight for each markup model, as shown here. The first favors Markov’s models that have more training data
that match the current state. The second weights all the models equally. The third gives decreasing
weight to longer n-grams, and the fourth model
gives increasing weight to longer n-grams. To determine the proper weight and maximum n-gram value to use, we ran a model validation. We ran several experiments in order to evaluate our
predictive triggering model, determine the maximum n-gram value, and choose an appropriate weighting scheme for our ensemble of markup models. To attain ground-truth performances, we manually recorded the
sequence of trigger poses for eight appearances of cartoon
Trump and three appearances of cartoon Hillary on “The Late
Show with Stephen Colbert”. Please see our paper for
details on our training and testing sets as well
as our error calculation. From the results, we determined
that weight type four, which gives increasing
weight to longer n-grams, shown in the darkest colors, and a maximum n-gram value of eight offers the best behavior for both the
tests of Trump, shown in red, and those of Hillary, shown in blue. In addition to the quantitative evaluation of our predictive triggering model, we conducted a user study
with 16 participants comparing four triggering
interfaces: no suggestions, on suggestion, three suggestions, and the baseline keyboard
with icon stickers. Each study session consisted of two parts: a practice and a performance. During the practice period, participants rehearsed their
script or responses four times using the keyboard and four
times using the interface with no highlight suggestions. We used these four
practices with our interface to train our predictive triggering model. We then asked participants
to practice the same script with all four of our interface
conditions to familiarize themselves with the
appearance of suggestions. After the practice period, the participants performed
four variations of the script in order of increasing difficult using each interface once
in a counterbalance order. We collected the rankings of
the interfaces from the users, as displayed in this histogram. Users showed a clear
preference for our interface over the baseline keyboard condition. However, some users prefer no suggestions while other liked the suggestions. To better understand how our system addresses professional workloads, we conducted an informal demo session with the production
team at “The Late Show”. We asked them to use our
system to animate cartoon Trump answering a question from a
previous episode of the show. Here is the result. Oh, wait. Where’s the audio? – Glad you asked. First, I gold plate the
entire city of Cleveland, including the people. Then, I ride in on a chariot pulled by showgirls
dressed like Lady Liberty, and unlike the real statue,
these girls are 10s. It’s a total ground session. Then, I take my thrown and
announce my vice president, Optimus Prime. Together, we will transform
America to be great again. Roll credits. – The production team gave us lots of positive
feedback on our system, as well as interesting
suggestions for future work. In addition to the functionality
that we currently support, some participants in our study suggested other trigger types. For example, it may be
useful to have a trigger that enables continuous
transformation of the artwork via direct manipulation. Another interesting direction to explore is how to leverage context
from spoken dialogue to improve the predictive
triggering model. For example, we can
incorporate audio features captured during practice
sessions into a model, then during a performance, the audio could serve as an additional cue to refine the suggested triggers. As noted in the evaluation
of our predictive model and our user study, the quality of suggestions decreases when the performance deviates
from the practice sessions. One way to improve the utility
of our system in such a case is to develop a more sophisticated model that can better handle
diverse training data. In conclusion, we believe
that live performed animation represents an interesting
new application domain for the HCI research community. We hope that our work inspires others to investigate the unique
challenges and opportunities that arise from this emerging medium. We want to thank “The Late
Show with Stephen Colbert” for the helpful feedback and
the use of their characters. Thank you for your attention. (audience applause) – [Rob] I’m Rob Miller from MIT CSAIL. It wasn’t quite clear to me, but the original animations actually had multiple puppeteers. They had one person doing the voice and one or more people actually
controlling the animation. Is that true? – In the production setting
they usually had one person acting out the voice to a
script and then just one other person on the keyboard
doing the different triggers. – [Rob] When you talked to those teams and showed them what you had done, did you get a sense that they
wanted to then reduce that to just one person doing both
the voice and the animation, or was there still value for them in having multiple puppeteers? – There was still value
for them having two people mainly because the voice actor
was very skilled in doing different voices and the
person doing the triggering would be more familiar with
the actions of the character, so even with this sort of system they would keep those two roles separate. However, the system would
make it easier to switch out the person doing the
triggering as opposed to having more practice and more
experience with that. – [Fengyuan] Hi. My name is Fengyuan Li from
University of Michigan. Fascinating work. – Thanks. – [Fengyuan] I saw you did
a subject grading survey on the predictive
modeling, like highlights. I wonder if it was effective
for the user, but if so, I wonder how it will work against it if someone wants to improvise. – Right, so users can still
improvise with our system. The triggers are grayed out,
but you can still press them. Users found that the
suggestions worked really well when they stuck with the
practice that they did. So if what they were practicing matched what they were
doing at performance time, they found it really well. Sometimes they found the three suggestions a little distracting
if they deviated a lot. – [Steve] Steve Feiner,
Columbia University. I really like that fact that
you were working with people who actually use these systems
in the original versions. One interesting trade-off
between things that have the flexibility of a
touch sensitive flat panel is that comparing that with physical keys, you can have your fingers
on the physical keys, you can feel even without
looking where your fingers are, which is something that
could really be advantageous to someone who is doing this a lot. I was wondering if you had
any comments from the folks you were working with about
the trade-offs between this system where it was exactly
where you might want it to be on the screen relative to the character, versus being able to go and
have your fingers on the ready to go and quickly maneuver
between the different keys. – Right, so from the feedback
from “The Late Show”, the professional team, they had no problem not having
the physical keys there. They actually preferred
seeing the live preview of the character right next
to where they’re pressing to get a better feedback and response. From the user study, some of
the users did mention that they really like the tactile
feedback of pressing the keys. I think maybe one of the
reasons behind that is, the users studies, they found that three
keys worked fine for them and they only used that, whereas the professional
animators really like to emote more in their performance
and use the wide range. – [Steve] Thank you.

Add a Comment

Your email address will not be published. Required fields are marked *