一.论文简介

论文放出来的时间：2023年2月9日在arXiv上放出来的
论文中了 2023 CHI 人机交互顶会
《Zeno: 机器学习行为评估的交互式框架》。
- 这篇论文探讨了机器学习模型在实际应用中可能出现的系统性失败，如有害的偏见和安全问题，并提出了行为评估的概念，这是一种通过检查模型对特定输入的输出来发现和减轻这些失败的方法。
- 作者们通过与18位机器学习实践者的访谈，发现行为评估是一个需要合作的、以用例为先的过程，而现有的工具并不能充分支持这个过程。
- 基于这些发现，他们设计了Zeno，这是一个用于在各种用例中可视化和测试AI系统的通用框架，通过四个案例研究，他们发现实践者能够复现以前的手动分析，并发现新的系统性失败。

目前小白阶段，写不太出自己的更多思考… 就先积累知识面吧

二.常识、语料、术语积累

1.术语

术语	解释	论文中用到
toy example	在论文中,“toy example”（玩具示例）指的是一个简化的、小规模的示例，旨在帮助解释或说明作者的观点、方法或模型。这个示例通常是问题的简化版本，可以更容易地理解和分析。
CCS concepts	CCS CONCEPTS 是 ACM（Association for Computing Machinery）计算机协会设计的计算学科分类系统。在 ACM 论文投稿时，作者需要为论文选择相应的 CCS CONCEPTS 分类，以便读者更好地理解论文的主要内容。CCS CONCEPTS 可以看作一个论文的分类索引，是 ACM 出版论文时论文中必须要附带的东西。	这篇论文的keywords前面列出了所属的CCS concepts： Human-centered computing → Interactive systems and tools; Computing methodologies → Machine learning; Artificial intelligence.
behavioral evaluation	行为评估是人工智能（AI）领域中的一个重要领域，它使用机器学习、人工智能、大数据和分析等技术来分析和预测行为。行为评估的应用非常广泛，包括但不限于以下领域：网络安全、人才招聘、健康管理、心理学、行为变革等。在AI领域中，行为评估的定义是通过观察、描述、解释和预测行为来评估个体行为的系统研究和评估。行为评估可以帮助人们更好地了解和预测人类行为，从而更好地应对各种挑战和问题。
metamorphic testing	Metamorphic testing 是一种软件测试技术，通过检查程序的多个执行结果之间的关系来测试程序，可以有效地解决测试准则问题。具体来说，metamorphic testing 通过利用程序的某些属性来检查程序的输出，而不是直接比较程序的输出和预期输出。这种方法可以解决测试人员难以确定预期输出或无法判断程序输出是否满足预期结果的问题。Metamorphic testing 的主要思想是利用程序的蜕变性质，即对于给定的输入，程序的输出应该具有某些不变的性质。因此，metamorphic testing 可以应用于各种类型的软件系统，包括串行程序、并行程序、分布式系统等。	Another common method for behavioral evaluation is metamorphic testing , a concept from software engineering that involves checking the outputs of a blackbox system for inputs that are perturbed in a specific way 另一种常见的行为评估方法是变形测试，这是软件工程中的一个概念，它涉及检查黑盒系统的输出，以查找以特定方式扰动的输入。
snowball sampling	Snowball sampling 是一种非概率抽样技术，通常用于研究那些难以接触或隐藏的人群，如毒品使用者或性工作者等。该方法的基本思想是利用已有的研究对象来招募未来的研究对象，从而形成一个逐渐扩大的样本群体，就像滚雪球一样。Snowball sampling 的优点是可以帮助研究人员更好地接触那些难以接触的人群，同时可以节省时间和成本。然而，Snowball sampling 也存在一些缺点，如样本的代表性不足、样本的偏差等。因此，在使用 Snowball sampling 时需要注意其局限性，并结合具体情况进行合理的样本选择和分析	The initial participants were recruited through posts on social media networks, e.g., Reddit, LinkedIn, and Discord, and through direct contacts at technology companies. Additional participants were then recruited through snowball sampling. 最初的参与者是通过社交媒体网络，如Reddit、LinkedIn和Discord上的帖子，以及通过技术公司的直接联系来招募的。然后通过雪球抽样招募了其他参与者。
qualitative analysis	定性分析
quantitative analysis	定量分析
dogfooding	Dogfooding 是一种 IT 行业的俚语，指的是使用自己公司或团队开发的产品或服务。这种做法可以帮助开发者或公司更好地了解自己的产品或服务，发现其中的问题和不足之处，并及时进行改进。Dogfooding 还可以作为一种质量控制手段，帮助开发者或公司提高产品或服务的质量和可靠性。Dogfooding 这个词最早出现在微软公司，由 Paul Maritz 在 1988 年的一封电子邮件中提出，意为“吃自己的狗粮”，鼓励员工使用公司的产品。Dogfooding 在软件开发领域尤其常见，开发者可以在实际使用中发现软件的问题和不足之处，并及时进行修复和改进。
end-user	End-user 是指最终使用或拟使用产品的人，通常是消费者。End-user 与支持或维护产品的用户（如系统管理员、数据库管理员、IT 专家、软件专业人员和计算机技术人员）不同。End-user 的地位已经从 1950 年代的一个不与主机交互的位置（计算机专家编程和运行主机）变为 2010 年代的一个与管理信息系统和信息技术部门合作并就其对系统或产品的需求提供建议的位置。在产品开发中，end-user 的地位非常重要，因为产品的设计和开发应该以最终用户的需求和期望为中心。在 IT 行业中，end-user 还可以指代使用自己公司或团队开发的产品或服务的人，这种做法被称为 dogfooding
trade-off	Trade-off 是指在两个或多个选择之间进行权衡和取舍的过程，通常涉及到放弃一些东西以获得另一些东西在 AI 领域中，trade-off 可以指以下概念： Performance-accuracy trade-off：性能-准确性权衡，指在 AI 系统中，提高性能（如速度、效率）可能会降低准确性，反之亦然。 Bias-variance trade-off：偏差-方差权衡，指在机器学习中，模型的偏差和方差之间存在一种权衡关系，需要在两者之间进行平衡，以获得最优的模型。 Explainability-accuracy trade-off：可解释性-准确性权衡，指在 AI 系统中，提高准确性可能会降低系统的可解释性，反之亦然。在 AI 领域的论文中，通常会讨论这些 trade-off 问题，并提出相应的解决方案	For example, P16’s management team often makes decisions based solely on a high F1 score, while it is often the case that different clients require different trade-offs between precision and recall.
model-agnostic	Model-agnostic 方法是指可以用于任何机器学习模型的方法，不考虑模型的类型或复杂性。这些方法旨在提供灵活性和与不同模型的兼容性，使研究人员和从业者能够将其应用于各种问题。Model-agnostic 方法通常用于解释性，即理解和解释机器学习模型如何进行预测或决策。Model-agnostic 方法的例子包括部分依赖图、排列特征重要性和 SHAP 值。当比较不同模型的性能或使用难以解释的复杂模型时,Model-agnostic 方法特别有用。

2.生词

词汇	意思	论文中句子
stakeholder	利益相关者	Enumerating what behaviors a model should have or what types of errors it could produce requires collaboration between stakeholders such as ML engineers, designers, and domain experts. 枚举模型应该具有的行为或可能产生的错误类型需要机器学习工程师、设计师和领域专家等利益相关者的合作。
canonical	典范的	Evaluating a machine learning model is the challenge of understanding how well a model can accomplish a given task. The canonical approach to evaluation is to calculate an aggregate performance metric on a held-out sample of data or test set. 评价一个机器学习模型的挑战在于理解模型在完成给定任务时的表现如何。评价的典范方法是在保留的数据样本或测试集上计算聚合性能指标。
end up	最终	Practitioners end up manually testing hand-picked examples from users and stakeholders, making it challenging to effectively compare models and pick the best version to deploy 从业者最终会手动测试用户和利益相关者挑选出的示例，这使得有效比较模型并选择最佳版本部署变得具有挑战性。
invariant	不变的	For example, a practitioner creating a sentiment classification model might check that the model works for double negatives, is invariant to gender, and is accurate for short text. In addition to aggregate metrics, they would check how their model performs in these specific scenarios. 比如，一个从业者创建一个情感分类模型，可能会检查模型是否适用于双重否定，是否对性别不变，以及是否对短文本准确。除了聚合指标，他们还会检查他们的模型在这些特定场景中的表现。
perturb	扰动	Checklist is a metamorphic testing tool for NLP models that perturbs text inputs, for example, switching proper nouns and testing if a model’s output switches. zeno enables users to do slice-based and metamorphic testing for any domain and task. Checklist 是一个用于NLP模型的变形测试工具，它扰动文本输入，例如，切换专有名词并测试模型的输出是否切换。zeno使用户能够对任何领域和任务进行基于切片和变形的测试。
blindspot	盲点	Algorithmic methods are a common approach for detecting groups of instances with high error, often termed “blindspots”. 算法方法是检测高误差实例组的常见方法，通常称为“盲点”。
codify	编纂	Complex models also require robust reporting methods to ensure that information about data and models is recorded and preserved. Datasheets for Datasets , FactSheets , Nutritional Labels, and Model Cards codified the first principles for documenting ML details for future use and reproducibility. 复杂的模型也需要强大的报告方法，以确保有关数据和模型的信息被记录和保存。数据集数据表、事实表、营养标签和模型卡编纂了记录ML细节的第一原则，以备将来使用和重现。
recruit	招募	The initial participants were recruited through posts on social media networks, e.g., Reddit, LinkedIn, and Discord, and through direct contacts at technology companies. Additional participants were then recruited through snowball sampling. 最初的参与者是通过社交媒体网络，如Reddit、LinkedIn和Discord上的帖子，以及通过技术公司的直接联系来招募的。然后通过雪球抽样招募了其他参与者。这个词经常在论文里看到
probe	探测	A common technique 11 of the 18 participants mentioned was creating their own data inputs to probe a model and find potential failures, often called “dogfooding” in software development. 一个常见的技术是创建自己的数据输入来探测模型并找到潜在的故障，这在软件开发中通常被称为“dogfooding”。
systemtic	系统的
typo	打字错误
standardlized	标准化
prototype	原型
agile	敏捷的
stochastic	随机的	Although updating a model can improve the overall performance of an ML system, it can also lead to new failures. This is especially true for stochastic models, such as deep learning, which cannot be deterministically updated. 尽管更新模型可以提高ML系统的整体性能，但也可能导致新的故障。这对于随机模型(如深度学习)尤其如此，这些模型不能确定地更新。
deterministic	确定的	Although updating a model can improve the overall performance of an ML system, it can also lead to new failures. This is especially true for stochastic models, such as deep learning, which cannot be deterministically updated. 尽管更新模型可以提高ML系统的整体性能，但也可能导致新的故障。这对于随机模型(如深度学习)尤其如此，这些模型不能确定地更新。
fragment	碎片	However, since many model evaluations are run inconsistently and across different tools, the history of past performance is often fragmented or lost, making it difficult to find regressions or new failures. 然而，由于许多模型评估是不一致的，并且在不同的工具中运行，过去性能的历史往往是分散的或丢失的，这使得很难找到回归或新的故障。
span	跨越；跨度	Modern machine learning development in practice is a collaborative effort that spans different teams and roles 现代机器学习开发实践是一个跨越不同团队和角色的协作努力
one-off	一次性的
the drop-down menus	下拉菜单	Although updating a model can improve the overall performance of an ML system, it can also lead to new failures. This is especially true for stochastic models, such as deep learning, which cannot be deterministically updated. 尽管更新模型可以提高ML系统的整体性能，但也可能导致新的故障。这对于随机模型(如深度学习)尤其如此，这些模型不能确定地更新。
slope	斜率	For each slice,zeno fits a simple linear regression of the selected metric across models, and users are alerted of slices with significant negative slope by a downward arrow next to the sparkline 对于每个切片，zeno在模型之间拟合所选指标的简单线性回归，并通过sparkline旁边的向下箭头提醒用户具有显著负斜率的切片
asynchronous	异步的
impairment	损伤	The model aims to make UIs more accessible to people with visual impairments by informing them of the type of interface they are looking at. 该模型旨在通过告知视觉障碍人士他们正在查看的界面类型，使界面对视觉障碍人士更加易于访问。
tablet	平板电脑
laptop	笔记本电脑
misclassify	误分类
leakage	泄漏	They also make an effort to write evaluation code that is distinct from the training code to ensure that they avoid any bugs such as data leakage in the training process 他们还努力编写与训练代码不同的评估代码，以确保他们避免训练过程中的任何错误，例如数据泄漏
intersectional	交叉的	Due to their high quantity of metadata, the participant only looks at simple slices of data, and does not often explore intersectional slices of multiple features. 由于元数据数量很大，参与者只查看简单的数据切片，并且不经常探索多个特征的交叉切片。
amplitude	振幅
intutive	直观的
niche	小众的	Lastly, the participant reflected on how usable zeno would be for everyday users of algorithmic systems. They mentioned that technical terms such as “metadata” may be too niche for everyday users and could be renamed. Otherwise, they found the system intuitive and usable if set up for use by diverse end users. 最后，参与者反思了zeno对算法系统的日常用户来说是否易用。他们提到，像“元数据”这样的技术术语对于日常用户来说可能太小众了，可以改名。除此之外，如果为各种终端用户设置了使用方式，他们发现该系统直观且易于使用。
malignant	恶性的	malignant tumor dectection
parallel computation and caching	并行计算和缓存
bottleneck	瓶颈

3.表达积累

虽然目前读过的论文不多，但是感觉这个有些地方和高考应试作文一样，很多地方都有套路可循
另一方面积累一些高级表达

表达	借鉴点
Despite a growing focus on the importance of behavioral evaluation, it remains a challenging task in practice.	despite a growing focus on； remains a challenging task in practice.
Given the current state of behavioral evaluation for machine learning, this paper asks two guiding research questions	Given the current state of 先写考虑到XX的现状，再写研究问题
To this end, we make the following contributions:	我发现很多文章讲自己的贡献时，都是用这个句型，可以借鉴
We first explore the current state of machine learning evaluation, including common techniques and approaches. We then describe existing tools for evaluation, and conclude with methods for improving collaboration and shared model understanding in data science and ML.	这是讲related work的句子，里面we first explore…we then describe…we conclude with…这个句型也可以借鉴
Inspired by	这个词可以用来引用别人的工作
In this work we further explore evaluation in practice through our formative study	这个句子可以用来引出自己的工作 further explore 经常看到
dedicate more time to	dedicate 挺好的词，可以用来表达花费时间做某事
While we validated that most of the design goals were met byzeno, our case studies did not thoroughly explore how zeno could be used over longer periods (D3). All four participants worked with early-stage models and only used zeno for a limited time. Longer-term, in-situ studies would provide more nuanced feedback for the utility of zeno’s model comparison features. A benefit ofzeno’s ease of use, both with the API and UI, is that users can immediately start using zeno’s model tracking and comparison features as models move from research to deployment.	这是论文discussion部分的一段话，其中 only used zeno for a limited time 、longer-term、in-situ studies、more nuanced feedback、zeno’s ease of use、both with the API and UI、immediately start using zeno’s model tracking and comparison features as models move from research to deployment. 这些词都可以借鉴
zeno provides a general and extensible framework for the behavioral evaluation of ML, but leaves significant room to better address the challenges in the evaluation process.	leaves significant room to better address the challenges in the evaluation process. 这个句子可以借鉴，这是写future work的一句话
we would like to thank that	写在acknowledgement里面

三.工作流总结

了解到该论文：arxiv.org

arXiv是一个在线开放获取的预印本存储库，为研究人员提供快速共享和传播研究成果的平台，促进学术交流和合作。

阅读软件：ReadPaper

真的很好用！

查看会议收录
- dblp
  
  这篇收录在CHI 2023
解决语言障碍：
- Readpaper自带的翻译功能
- 网易有道AIbox
- chatgpt
- openAI translator插件
AI辅读论文
- Aminer
- perplexity.ai
  
  现在perplexity有学术专用搜索，用来了解论文的信息，学术术语之类的都很给力！
- 整了个GPT4
  - ask your pdf插件可以和论文的pdf对话
  - scholar AI插件可以全文翻译论文
  - 一些惊艳的效果