ACM CHI is the most prestigious venue in the field of Human-Computer Interaction. Due to the travel bans and other restrictions during the COVID-19 era, CHI 2021 was online.
Six PETLabbers (Allesandro, Gabriela, James, Kavous, Lahari, and Pooja) attended ACM CHI 2021, where they watched many presentations, attended LBW sessions, and met other researchers to get new friends and research collaborators. In this blog, each member will explain their impressions about CHI and the most exciting talk(s) they attended.
I have been told for years that CHI was THE conference in HCI, and that it was the ultimate goal of researchers to publish there. Thus, I was thrilled to be able to participate in it, even though it was held online I was thinking that that conference would be like no other. Well, I was kind of surprised at the good and bad, compared to other conferences, and finally could also not attend all the sessions I had planned to. Overall, the research quality was very high and the conference remained impressive, both in terms of quantity and quality. Interestingly, I found myself having more interest in poster sessions compared to the other regular ones.
My research revolves around encouraging people to increase or maintain a certain level of physical activity through social support. Thus, I was trying to select sessions and papers that addressed either physical activity or agency. Fortunately, there is definitely a growing trend in the use of conversational agents, but sports and physical activity were a bit less treated in the selected papers. One of the first talks I attended in the conference was the one from Ashktorab and her colleagues from the IBM Research AI. These researchers have analysed the effect of communication directionality in human-AI interactions. While I do not necessarily intend to go in the conversational agents path, I was seeking information on the communication and the way both could interact. And this paper, as I will detail later, provided me with interesting questions and hints to include in my research.
Ashktorab et al. Effects of Communication Directionality and AI Agent Differences in Human-AI Interaction
In this paper, the authors explore the effects of communication directionality, and responsiveness on social perception in human-AI interaction. In past research on social perception, scholars have established that this perception of peers is an essential element for collaboration. Therefore, the authors have been trying to measure and understand the social perception one could have when exposed to an AI-driven conversational agent (chatbot). Their setting involved the user and the chatbot into a textual game where one had to guess the word provided to the other. Both peers in the task endorsed the role of the guesser and the one of the “giver”, as defined by the researchers. Thus, the human had to manage the hints she would provide to the chatbot, since there might be limitations to its interpretation skills.
The game they used is called Guess the Word, and consists of a collaborative textual game. The players have to endorse either a role of guesser or giver. As a guesser, the player has to guess a specific word that the giver knows but cannot tell. As the giver, the player has to provide hints to the guesser in order for her to find the word. In this research’s context the limit of words that could be provided is 10, past this limit the game is lost. Here, the interesting part comes on how the human user will interact with the AI in order to give it the proper hints. I personally asked myself the questions: how are we interpreting the agents and there capabilities ? Is it naturally limited, due to a biased idea that the agent is intelligent to a certain extent ? Is it rather the contrary, we could say anything because the machine knows everything ?
In their study, the researchers have generated three different models, each of them based on a different technique. The first model (Model A) has been trained using three different data sources: the “Free Association Norms”, a dataset that contains associated words (like what is the first word that comes into your mind when given: “artificial”, it could be “intelligence”); word embeddings, a technique where words are vectorized into numbers; and WordNet one of the most famous dataset of words. The Model A is based on a Gradient Boosting Machine model which is a supervised learning technique. For additional details on the way it got trained I suggest you look at their paper. The second approach relies on the same dataset but uses a different technique known as: Reinforcement Learning. In this case, the system self-learns to optimise the choices to maximise a reward. More interestingly, they used the agent-agent self-play approach as some of the most known RL-based game agents are doing (example: AlphaGo, AlphaStar, or OpenAI Five). The third, and last, model was designed using a data-driven approach and the Forward Association Strength technique. Here, the idea is to train the system with associations where one word is proposed to the user, and the user will give an associated word. It might seem quite similar to the Model A however, it is totally different since it is based on specific formulas linking a cue (the word given by the system) and the target (the word provided in response). For example, when the chatbot was playing as the giver, it would see the probability of a known target to be associated with the cue (so the word that the user is trying to guess) and would likely provide it as a hint. Thus, this kind of approach requires a lot of user inputs in the training compared to the two others.
In their experiments, the participants (N=199) were assigned to a particular model (the ones described earlier, defined as: A, B, and C) and were either told that they were playing against a human opponent, or an AI one. The main metrics they used to capture social perception were: intelligence, rapport (e.g. did the opponent seem engaged or not), and likeability. Additionally, the authors asked the question whether the participant felt she was playing against a human or an AI player. The results from their study demonstrate that the participants adopted different communication strategies when interacting with an entity perceived as a human than an AI. This led them to play differently and optimise the hints given, like providing the opposite word than the one before, etc. Interestingly, with one model (A), the users were likely to find it more intelligent when users perceived it as an AI, and it was guessing the word. Hence, the directionality (giver or guesser) had an impact on the perception of the opponent. In most of the cases when the AI was playing the role of giver, the users were having lower social perception on all the metrics used, especially when perceiving it as an AI opponent. The researchers also suggest that the feeling of being in control (the giver) was maybe more comfortable for the users, and also changed the social perception of their opponent. The authors end their article by reminding that evaluating the social perception of an agent cannot be done properly when users are presented with only one type of agent, or at least having the same exact behaviour all the time. Additionally, they emphasise on the context in which the collaboration takes place, as a determinant factor of the social perception.
In my research, as I also presented before, this provides many different ideas and information on how to approach human-AI collaboration and interaction in general. As mentioned before, I for now do not intend to create conversational agents, however creating different agents-behavior could likely be an interesting path. Depending on the behavior of the agent, would we react differently to the social support provided by it ? If we know that it is a machine behind, do we necessarily think that it can outperform us ? Should it ? These are some of the questions that rose in my mind while reading this article, and following the presentation.
Among all the list of selected papers on behavior change and persuasive technology, Odalapo and colleague’s piece was the one that caught my attention the most. Not only because the topic is right on the spot for my research interests but primarily because from reading the title, I could imply that they followed a similar methodology to the one I use in my research. I always find it interesting to see how other researchers explore other approaches and theories parallel to mine. In this case, they designed a persuasive system based on the Transtheoretical Model or Stages of Change. How is this similar to my research? Well, I design persuasive systems based on the Self-Determination Theory. Let’s explore what these researchers found.
Odalapo et al. Tailoring Persuasive and Behaviour Change Systems Based on Stages of Change and Motivation
This research had several objectives: First, they wanted to explore how individuals perceived persuasive strategies at various stages of change. Next, they used the ARCS motivation model to understand why the persuasive strategies they selected motivate behavior change. They found that the stage of change in which individuals situate themselves plays a significant role in the perceived persuasiveness of different strategies and how these strategies motivate for various reasons.
To better understand this research article, we need to be aware that the Transtheoretical Model of Behavior Change states that individuals progress through six stages to adopt health behavior(i.e., precontemplation, contemplation, preparation, action, maintenance, and termination).
Second, the authors employed persuasive strategies which distill from the Persuasive System Design. There are a total of 28 strategies, but authors selected the top 5 based on popularity from research articles (i.e., self-monitoring, reminder, suggestion, social role, praise).
Third, we need to know which constructs are used on the ARCS Model of Motivation: attention, relevance, confidence, and satisfaction.
Once these concepts are clear we can continue to the methodological part of the article. First, the authors designed a high-fidelity prototype that includes the previously selected persuasive strategies. Then they conducted a large-scale online survey to elicit participants’ responses concerning the effectiveness and motivational appeal using standardized questionnaires.
Their most interesting results found out that self-monitoring motivates behavior change for people at different stages of change through various mechanisms. The authors support their findings with quotes extracted from the comments participants input as part of their questionnaire answers. Finally, they conclude the article by providing design guidelines for persuasive system design.
Overall I enjoyed this paper. However a major limitation is that their results are based on subjective accounts of what people would consider the persuasive strategies will provoke on them. This is questionable as people do not always do what they say they will do . It would be interesting to see how these perceptions will be when giving participants a fully functional system instead of a high fidelity prototype. This is something authors mention they will do in future work. I will definitely keep an eye on their research with big expectations of coming insightful results.
 Sayette, M. A., Loewenstein, G., Griffin, K. M., & Black, J. J. (2008). Exploring the cold-to-hot empathy gap in smokers. Psychological science, 19(9), 926-932.
While it is my first time attending CHI, like many, I hope this will be my last in a virtual setting. In addition to the solid academic contributions of this seminal event each year, it is also a vital calendar entry for anyone in the HCI field for its opportunities to ‘mingle’ with like-minded researchers and situate oneself in the bustle of breakthrough thinking. So CHI feels very different in a pandemic world, no less rewarding but an entirely different experience to navigate.
For myself, it was exciting to see some relevant work explicitly in the domain of my research. Self-reflection is often adjacent to other topics like well-being and wellness but more often than not a rarity as a principal subject. At CHI this year, two works, in particular, captured my attention; they each make significant contributions and insights in a space that is integral to my PhD journey.
Bentvelzen et al. ‘The Development and Validation of the Technology-Supported Reflection Inventory’
A discovery in my research is that although there is often mention or allusion to self-reflection, work done is often through contributions outside of the HCI ‘bubble’. The way that self-reflection is measured is generally through instruments of other disciplines like psychology. It is then up to the researcher to factor or consider the technological element.
Bentvelzen describes developing a Technology-Supported Reflection Inventory (TSRI) which hopes to evolve existing measures with more excellent utility as a tool to compare technological artefacts and prototypes. One of the instruments they started with is The Self Reflection and Insight Scale (SRIS) – a tool that I have already utilised in my research. So, the idea of an extension within HCI is intriguing.
Their description includes the stages of development, testing and validation that went into the inventory. Specifically, they had experts review a list of scale items that appear in existing toolsets. They made use of factor analysis to reduce this scale to a final total of nine. The result is three questions in three areas: insight, exploration and comparison. The questions are intuitive queries about how a test artefact makes the person feel or behave – with scores on a 7-point Likert scale (e.g. strongly disagree – strongly agree).
For my research, the TSRI may be a valuable tool for developing an adaptive system for self-reflection. A key difference to the regular SRIS scale is that the result of Bentvelzen’s inventory is indicative of the reflective change from a particular system. It is not a measure of a person’s baseline reflective stance/temperature. The TSRI could be invaluable for testing iterations of a prototype intervention (especially given its brevity, it is well suited for usability testing). Participants would likely need to be tested on the SRIS to produce a more holistic understanding of their temperaments.
Pieritz et al. ‘Personalised Recommendations in Mental Health Apps: The Impact of Autonomy and Data Sharing’
Personalised health care is a buzzword in computing circles. Emerging technologies may offer opportunities to engage with users at a new level of individuality. I am fascinated by this concept in my research and share the feeling with others that there may be essential thresholds in tailoring and personalisation. Most of us can appreciate where these kinds of things can tread into unsettling territory. One such example is the personalised advertising space and data collection that is a highly prolific topic within HCI, especially in recent years. For technology to get closer to an individual, it needs to collect information about its users directly or take an intuitive approach. Human beings approach one another similarly – when you make a new friend, for example. However, what technology enables is a tremendous scale of collection and intuition that are beyond the capacities of an individual human.
In Piertz’s work, they discuss the recent growth in interventions that target mental well-being. Many of these have experimented with personalised recommendations to improve their efficacy and appeal. Still, habit formation is always, on some level, autonomous. Perez and colleagues wanted to compare users’ engagement with systems designed for autonomy (the user had more freedom to choose what to try etc.) against those centred on personalised guidance. In addition to this, they looked at specific user preferences when it came to sharing the data, which enabled this personalisation in the first place. My work will hopefully lead to developing an adaptive and personalised system, so these observations are crucial to understanding user dynamics/preferences before the construction of my ideas. This kind of work can be invaluable to saving time and improving the likelihood that something will be helpful because it incorporates prior findings.
Piertz used a mental well-being application called Foundations as the basis of their work, with each study group having a different user experience. These were personalised more/less (e.g. a view which listed all the activities available vs one that showed a particular recommendation). These suggestions were built from a personality type inferred through device sensors or a questionnaire. In total, there were five groups (one of which was a control group), and the remaining four were:
For example, Data-Guided presented personalised recommendations based on the user’s inferred personality type from continuous data collection. Questionnaire-Guided provided personalised suggestions based on the personality type that participants reported during the discrete questionnaire.
Amongst the findings, two points are fascinating. People self-reported a stronger inclination toward questionnaires than allowing the app to collect data before group assignment. Still, it did not have an impact on app usage in practice. Regardless of how the personality was inferred, the user experience remained similar. Building upon this, although personalised experiences were seen as most preferable by participants beforehand, in use mixtures of autonomy and personalisation saw the most engagement.
I take a couple of important takeaways from this piece for my work. The first is that users appear to find onboarding through questionnaires as more stimulating/preferable. Taking a more technologically sophisticated approach may not be worth the additional investment. Users in this study did not seem to have more/less success taking a particular questionnaire instead of continued access to sensor data. Second, in line with my observations, personalisation does have its role to play. User autonomy is not a choice but an ingredient in the equation. In line with Bentvelzen’s work earlier, exploration seems to help users feel confident enough to take the initiative some of the time and sufficiently comfortable to accept assistance in others. Viewed analogously, the rapport in a non-technological therapeutic setting is similar. A psychological professional balances a patient’s cognitive exploration and offers timely prompts when kindling or support is needed. Piertz and their colleagues provide evidence of a similar process between a therapeutic interface and their user. This transactional equilibrium is vital to engagement. Giving users absolute freedom to choose activities may seem like a good idea. Still, it places all responsibility to generate competency on the user, and a fully personalised approach does the opposite. With a mixture, however, the user’s needs are more evenly satiated, and this encourages a virtuous cycle of engagement.
Kavous Salehzadeh Niksirat
Even Though I was attending my third CHI, ACM CHI 2021 was my first experience to be in a virtual CHI. Perhaps, in the following years, we will have a more sustainable and safe world so we can again fly over the continents, meet new people in CHI, and have little adventures during the busy CHI times. But I believe even if that happens we should keep organizing the conferences and in particular CHI in the “hybrid” model to support disabled persons, students, and academicians with no funds and visa restrictions.
Among many interesting presentations I attended, I selected two studies that are relevant to my research:
Park and Lee. Designing a Conversational Agent for Sexual Assault Survivors: Defining Burden of Self-Disclosure and Envisioning Survivor-Centered Solutions
This is a design work! The authors studied designing conversation agents (CAs) to support survivors of sexual abuse. Sexual violation is a serious issue for many women around the world. Women can be under sexual assault either by their partners or their non-partners. Unfortunately, those who survive from sexual violations have difficulties after the crime where they cannot report these cases to other people and the authorities. Usually, they avoid face-to-face disclosure as it has many other burdens. In most cases, the survivors face secondary victimization. In other words, people or authorities lay blame on the victims. Considering these issues, using a machine instead of a human could be an alternative for reporting and exchanging the necessary information.
In this paper, the authors first define survivors’ self-disclosure burdens, compared the burdens caused by a CA and a human, and co-designed features of CAs using the participatory design method. During the participatory design sessions, participants talked about their earlier experience with human agents and their requirements from CAs. Interestingly, the authors found that 17 survivors prefer to report their cases to CAs rather than the police officers.
The authors identified different self-disclosure burdens! The survivors expressed fewer burdens with CAs compared with humans considering time, financial aspects, availability, and emotional burdens (such as blame). They also mentioned several burdens for reporting their case to the CA, for example, privacy and security (e.g., they worry if other people know they are using such apps), social (e.g. they worry about the social pressure if other people see the notifications from the app), and emotional burdens (e.g., they think CAs just imitate and they fake empathy).
The authors mentioned several features that can reduce such burdens such as multi-stage authentication, locking the past conversations, using the “export to email” system, and camouflaging the app icon. To reduce the emotional burden and avoid the problem of faking empathy, the authors suggested the use of crowdsourced messages collected from other users. The authors also suggested that such an application should empower the survivors. For example, the legal procedures should only proceed if the users confirm them.
I was interested in this paper because it targets the type of users who are vulnerable to violence. It also studies CA design. This, in particular, is interesting for our group as we also wonder how CAs can support social media users to reduce privacy conflicts. Stay tuned to know more about our ongoing research topics!
Rakibul et al. Your Photo is so Funny that I don’t Mind Violating Your Privacy by Sharing it: Effects of Individual Humor Styles on Online Photo-sharing Behaviors
The authors studied the humor style of online social networks and its relationship with photo-sharing behaviors. An interesting finding was that when the authors primed users with warnings to avoid meme sharing, some users even shared more than before! The authors later categorized the user into different types and called people with such behaviors as “humor deniers”. The paper suggests that designers should personalize privacy-preserving interventions (e.g., warnings) based on the user type as such privacy-preserving warnings might backfire for some users.
I was interested in this work because it is about the problem of the Multiparty Privacy Conflicts (MPCs), where meme sharers (data uploaders) can threaten the privacy of memes (data subjects). In our recent work in PET and ISP Labs, we designed a new family of solutions called Dissuasive Mechanisms to deter non-consensual content uploaders from sharing others photos in online social networks. Perhaps it would be interesting to see how our dissuasive mechanisms could be effective in the meme sharing context and how users with different characteristics such as humor deniers could respond to the dissuasive mechanisms.
In addition to the papers mentioned above, I also like several other papers such as (i) a study about the role of Fear Of Missing Out (FOMO) on human privacy behavior, in particular in online social networks, (ii) a design work that proposed a new technique for password entry using muscle memory, (iii) a survey study about the effect of nudging (i.e., default and framing techniques) on privacy decisions of smart-home users’, (iv) an empirical study about security and privacy advice of the protestors during the #Black_Lives_Matter protests, (v) a study that used empathetic communication skills to develop chatbots to protect users against online financial frauds.
During CHI, I also met several people from different institutions that could lead to further information exchanges or collaborations.
I was really looking forward to attending CHI 2021 because it would be my first exposure to the premier conference and CHI community. This year, since CHI was a virtual event, it allowed non-authors (like myself) to participate in the conference and gain some valuable insights.
My research lies in the intersection of collaborative learning and learning programming. I also focus on the use of computational notebooks to teach programming. Amongst all the papers listed in CHI 2021, while I could not find anything interesting on collaborative learning, two starkly different papers on instructions for programming and use of computational notebooks for experts, caught my attention. The reason being, I could situate my research interests in the intersection of these two papers. Coincidentally, both the papers had common first authors. This also led to an interesting conversation with the author about his works in both the domains and future directions.
Weinmann et al. Improving Instruction of Programming Patterns with Faded Parsons Problems
Before we begin, let us know what programming patterns are. Programming patterns are reusable, high-level code abstractions that can accomplish a goal while solving a programming task. Knowing how to recognize and apply programming patterns to solve problems is critical in programming learning.
In this paper, the authors focus on exploring how to effectively teach programming patterns and incorporate that in Computer Science (CS) curricula. When students are given a programming problem to solve, they can find valid solutions in a variety of ways. A specific exercise on Tree traversal, for example, can be done using nested for loops as well as recursion. Instructions in the question can be used to specifically motivate students to practice the recursion pattern. However, the questions are still very open-ended, and it does not compel students to practice relevant programming patterns in particular.
The authors developed an interface to support three different types of Python programming practice : code-writing, code-tracing, and Faded Parsons Problems. Students were given a question prompt to construct a valid program in code writing exercises, whereas in code-tracing exercises, they were given programs in which they had to find output given certain input values. Faded Parsons Problems are variations of Parson’s Problem which is a format of objective assessment used in teaching programming. In Faded Parson’s problems students also had a problem prompt to solve, but this time they were given partially complete lines of code in shuffled order. They should rearrange the lines of code as well as complete each line of code in order to solve the problem.
The researchers conducted a classroom study with 237 CS1 students to determine which of these were effective in teaching programming patterns, providing effective practice for eventually writing code, and exploring student preference of Faded Parson’s problem. The findings revealed that Faded Parson’s Problem and code tracing were more likely to expose students to programming patterns than code writing exercises. In the case of code writing transfer, both Faded Parsons Problem and code writing were more effective, but only in Faded Parsons problem students were required to actively construct parts of the final program. Also the students preferred Faded Parson’s Problem as they found it easy to construct elegant solutions to their problems.
This paper piques my interest because it introduced me to Faded Parsons Problems and how to incorporate it into programming practice exercises. While the Faded Parsons Problem is an intriguing tool for effectively teaching patterns, I’m curious to see how it can be used to facilitate metacognitive reflection in programming problem solving. An aspect of my current project focuses on exploring metacognition in programming problem solving (stay tuned for it over here), and this work has a good potential to shape the future iterations of my project.
Weinmann et al. Fork It: Supporting Stateful Alternatives in Computational Notebooks
This study aims to support the use case of exploration in computational notebooks, which is frequently used by data scientists in their tasks. The authors focus on the fact that computational notebooks provide a single execution state for manipulating variables and tasks in order to perform problem exploration, and how this limits experts’ exploration methods.
The researchers identified the problems encountered through formative interviews with six data scientists and formulated the following design principles to support multiple state execution in computational notebooks : Express Alternatives, Manipulate Execution, Visualise Alternatives.
They designed and implemented a tool for notebooks called ‘Fork It’, that introduced two features : forking, creating new interpreter sessions and back-tracking, navigating through previous states. In the forking feature, the data scientists could fork a cell splitting it into multiple paths. This creates a new kernel for each path and runs independently of each other. The backtracking feature focuses on determining which interpreter session to use for forking. When one uses backtrack after executing a cell, they see the code from the last run cell at the current point in history, as well as the values of all variables at that point and the option to visit the execution history of the executed code cell. The user can select a particular state they want to explore from the history and choose to create a fork at that point.
A qualitative evaluation of the tool with 11 data scientists for a model prediction task, revealed their behaviour patterns when using these features. The experts used the two features to compare decisions, parallelise workflows, to isolate messy exploratory codes, debugging and undo execution.
Even though ‘Fork It’ is aimed at expert use-cases and has little to do with my field of study, it sparks discussion about how exploration can be encouraged and better supported in computational notebooks. Exploration can also be used as a method to help novice programmers learn and probe different programming concepts. Tools like ‘Fork It’ pave the way for future research into expanding the features of computational notebooks, particularly for use in the educational domain, which is something I would like to explore in my own research.
ACM CHI 2021 was special to me as this was my first time in the premier Human-Computer Interaction conference. It is the most prestigious conference in the world of HCI and my opportunity to meet and network with like-minded researchers. As this was a virtual event and I did not have prior experience with this monumental conference, I was unsure what to expect. The first day ensued with confusion. Difficulty navigating the platform, lag in the live videos and Q&A added to the confusion. However, as the day progressed, I got the hang of things, and my experience improved. I want to thank the ACM organising committee for making this event happen in these challenging conditions.
As my interests lie in the intersection of AI and HCI, many of the sessions I attended leaned towards this. I saw a lot of excitement for AI, and papers on various aspects like conversational agents or chatbots, explainability and fairness in AI, interpretability, Natural Language Processing were presented. Among the many presentations I attended, below I describe two which aligns with my research interests.
Yang & Aurisicchio. Designing Conversational Agents: A Self-Determination Theory Approach.
Self Determination Theory (SDT) is a broad theoretical framework used to study human motivation and development. SDT is a primary focus in our lab, and it was exciting to see this applied to the design of conversational agents (CAs). SDT assumes that humans have an innate tendency to grow and master challenges and seek new experiences, driving motivated behaviour. This study considers the Basic Psychological Need Theory, one of SDT’s six mini-theories, which posits that fulfilling competence, autonomy and relatedness needs will motivate the self to seek optimal function and growth. This is used to understand users’ underlying needs to achieve positive experiences with CAs and support their design. The authors consider this as SDT has proven to be effective at facilitating the design process in prior literature.
In the paper, the authors describe their work in two phases. In the first phase, they conduct interviews to understand how CAs fulfil or hinder the basic needs – competence, autonomy and relatedness and how to design for these needs. In the second phase, they derive ten actionable guidelines from the findings of the first phase that are grounded in human needs and to bring psychological benefits to CA users.
The findings of the study interpreting users’ perceptions and expectations of each of the three needs as well as the aspects of the CA that may support or hinder the needs are very insightful. The results reveal that how much users believed they had used CA capacities fully and how well they communicated with it often influenced their perception of competence. The sense of autonomy of the users is frequently demonstrated by being in charge of the conversation and the way their data is handled, and CA getting personalised. The authors also anticipate how CA can affect the relatedness as a communication tool and interface to communities. Six guidelines from implications relevant to competence and the other four from implications relevant to autonomy were developed. Relatedness was not taken into account in creating the guidelines as it can be particularly product dependent. But they note that the ten guidelines can support all three needs with varying degrees.
The findings from this paper are interesting to my research. I had an insightful discussion with the first author of this paper about the motivation for using SDT and future directions. These guidelines can benefit the design of automated systems in our projects. We are interested in understanding the effect these automated systems or agents can have on all three needs. Keep an eye out for updates!
Zytek et al. Sibyl: Explaining Machine Learning Models for High-Stakes Decision Making.
There has been growing evidence that a lack of transparency in complex systems might negatively influence users’ trust concerning AI choices. This lack of confidence can also degrade the overall user experience. The resurgent scientific field of explainable AI (XAI) looks at exploring solutions to this dilemma. One of XAI’s goals is to develop novel explanation algorithms that promise to provide new insights into state-of-the-art machine learning black-box models, allowing users to understand better and trust AI systems. This is also of interest to me, personally.
In this work, the authors examine the use of an ML model in a qualitative area for high stake decision-making: screening for maltreatment referrals to child welfare services. They propose and develop Sibyl, a machine learning explanation dashboard designed to allow humans to work with algorithms to make decisions without technical expertise.
The authors conducted a study with a pool of 19 social workers and supervisors working for the child welfare department. They perform a simulated case review session using the ML model. The collaborating screeners were asked to make decisions about historical referrals as they usually would — but with the addition of ML model scores. This helped the authors understand the existing child welfare screening workflow and identify the explanation needs present in child welfare screening. They interviewed screeners on what type of extra information they would want to see alongside ML predictions. Based on these, they design Sibyl with different features of providing explanations to aid in decision making. The main explanation design interfaces are case-specific local feature contributions, what-if explanations, global feature importance with model’s general logic and feature distributions among the past cases.
A user study conducted with Sibyl revealed mixed results. Though the explained interface was found to be helpful to both experts and non-experts, the explanations were more transparent and model-friendly than user-friendly. The explanations were reported to both increase and decrease perceived user trust. This study was an attempt in the right direction of explainable and interpretable models in high-risk situations and demonstrates that there is a lot of scope for improvement in the field.