Report: Only 20% of Wikipedia biographies talk about women


Wikipedia, which ranks as one of the 10 most visited websites in the world, is the number one destination for many people looking for information on historical figures and change-makers. But not everyone is represented equally on Wikipedia.

According to the Wikimedia Foundation, only about 20% of the resumes on the site’s English language version speak of women, and we think the percentage is lower for women included in multifaceted groups, such as women in science, women in Africa, and women in Asia, Angela said. Fan, Research Scientist at Meta.

“For my PhD project as a computer science student at the University of Lorraine, Enria, in France, I worked with Claire Jardin to develop a new way to address this inequity using artificial intelligence,” she added.

“We have created an AI system that can search and write rough drafts of biographical entries in the same style as Wikipedia. Important people who are not currently on the site.”

“The problem was personal to me and based on the lack of representation that I saw appearing in libraries when I was in elementary school,” she said. “When I was in the third grade, I was assigned to write an essay about a historical figure, and the only condition was that there be a book about that character in the library.” She added, “I wanted to write about Eleanor Roosevelt, but I had to just write about Teddy Roosevelt.”

What if I wanted to write about someone who looked like me – was that available? If we think about the same task today, students will undoubtedly turn to the Internet, most likely Wikipedia. Wikipedia has millions of articles written in English, including a great one on Eleanor Roosevelt. We know there are still many women whose stories and achievements have not reached future generations.” While women are more likely to write biographies about other women, Wikimedias 2021 Community Insights Report, which covers the previous year, found that 15 percent Only women editors of Wikipedia are women.

This leads to the neglect and marginalization of women, despite the enormous influence that they have had throughout history in the fields of science, entrepreneurship, politics and every other area of ​​society.

Canadian physicist Donna Strickland won the Nobel Prize in Physics in 2018, however, anyone searching for information on her on Wikipedia wouldn’t have been able to find her, until a biography on Wikipedia of her great work was finally published, days after she won the largest. award in her field of study.

Various studies, including studies from the Wikimedia Foundation itself, have expressed gender inequality on the platform. Even with underrepresentation, biographies of women are still disproportionately nominated for deletion.

One study found that in 2017, 41% of resumes nominated for deletion were about women. “We believe that open, renewable science can provide a starting point for addressing this problem,” she said. “Today we provide a comprehensive, open source artificial intelligence model that automatically generates high-quality biographical articles on important public figures in the world.” “Our model searches websites for relevant information and sets up the Wikipedia-style entry for this person, supplemented by citations of sources.

Along with the release of the model, we are rolling out a new dataset that has been generated to assess the performance of the model in 1,527 CVs of women from marginalized groups.”

“This dataset can be used to train models, evaluate performance, and refine the model. We believe this AI-generated input can be used as a starting point for people who write content on Wikipedia, and fact-checkers, to post more bios of marginalized groups on the site,” she said. . She emphasized that there is still much we can do to help provide representation to Wikipedia more broadly for outstanding people of all backgrounds. Essentially, AI systems, like the one created, will have to face societal and technical challenges on a large scale in order to fully address the problem.

This begins with the content of the websites used to create Wikipedia entries, which may be flawed or express cultural biases. On the technical side, the text generation system may be prone to “babbling” with unreal content.

Even the best language models today find it difficult to create coherent text in many paragraphs. She said: ‘We hope to improve this by making progress in the neural architecture that supports such models, and through tangible achievements in the responsible development of artificial intelligence, as well as that this approach will be able to help non-experts to create accurate articles to add to the collection of information on the Internet, with Only minimal modification required.

How AI Can Supplement Existing Efforts to Address Bias While our model is not a panacea, it is an important step to support and complement other current efforts working to address gender representation on Wikipedia.

Volunteer editors Jessica Wade and Penny Richards have worked independently to write and publish thousands of Wikipedia biographies of women who deserve to be distinguished. According to the report, another large collective effort is the “Women in Red” Wikipedia project, which involves editors to create new bios and expand existing biographies of notable women past and present. We decided to take a complementary approach, since doing research, creating a bibliography and writing is intense, yet there is a body of information available online that can be used to share the stories of women whose achievements, voices and legacies have been forgotten, or marginalized. For example, we used the template to create a short biography of Libby Hyman, a pioneer in the study of invertebrate zoology.

The green text is pulled from the reference article we started with, the purple text is from the Internet directory, the orange text is delirium; This means that the form contains information that cannot be verified. The form retrieved relevant biographical information about Hyman, including her focus on invertebrates, important publications, and the impact of her work, which could then be used as a starting point for editors for fact-checking (an area that still presents some shortcomings for the model) and expansion In her life and achievements. Using the Pretraining and Retrieval Model to Create Resumes on Wikipedia We start the resume creation process with an augmented retrieval build structure based on extensive pretraining, which teaches the model how to select only relevant information, such as place of birth or where a person attended school, while Creates a biography, says Angela Fan in her report.

The form first retrieves relevant information from the Internet to display the topic. After that, the creation form generates the text, while the third step is the citation model, and it works on creating the bibliography and linking it back to the sources that were used. The process is then repeated, with each section predicting the next, incorporating all the elements that make up a strong Wikipedia resume, including the subject’s early stage, education, and career. We create section by section, using a caching mechanism similar to Transformer-XL to refer to previously written sections, achieving greater document-wide context.

Caching is important because it allows the form to better keep track of what was created earlier. Automatic and human evaluations show that the model is able to find relevant information and use it to create resumes, but there is still work to be done. These evaluations found that 68% of the text generated in the resumes we created was not found in the reference text.

This means several things, including noting that the model does a good job of finding and compiling relevant information but that it does not act as an automatic plagiarism program. However, this is also not clear; Because it is difficult to know which information is accurate and which is not. We asked evaluators to determine whether complete sentences were accurate or not, and found many cases where sentences could only be partially verified.

These challenges are similar to those of text generation at large, although they are exacerbated in the case of marginalized groups, where there is very little data on them. We hope that the release of this data set will allow other researchers to study this problem.

Obstacles during research First, the lack of training data, or already existing biographical articles on women, was very difficult to overcome. The articles on women, especially on marginalized groups, are much shorter than the regular article on men, and less detailed, and use a different language, for example, “female scientist” rather than just the word “scientist.”

This bias in the training data caused models to accommodate this bias. In addition, Wikipedia articles must be written based on factual evidence, often obtained from the Internet. However, Wikipedias bias extends to internet bias: there are very few internet-based sites that can be used as evidence. While the inherent problems cannot be resolved quickly, it is precisely this type of problem that technology can be used to help bring about positive change.

What is the next step? Highlighting more marginalized people on Wikipedia We’re excited to share this work with the community to help foster discussions, experiment and advance progress with the goal of helping create a more equitable availability of content on Wikipedia. Our model addresses only one part of a multifaceted problem, so there are additional areas in which new technologies must be explored.

When an editor for Wikipedia or our AI model writes a resume, information is pulled from various online sources and cited. However, despite all the rich knowledge that the Internet has provided, some sources have biases that must be taken into account. For example, when women are represented, their resumes are more likely to include additional details about their personal lives. A 2015 study found that the word “divorced” appears on women’s resumes four times more often than it does in men’s resumes.

This may be for several reasons, including tabloids that tend to follow the lives of notable women more closely than the lives of men.

As a result, women are more likely to mention personal details in articles, distracting from achievements that should be in the spotlight and celebrated.

Technology has already shown promise in helping to address multiple models of inequality, evidence that there is more that society can do to help make a difference.

For example, the site’s former CEO explained how an algorithm discovered a critical bug on the site: While Wikipedias health articles are verified by medical editors, for years, some articles on critical health issues for women, such as breastfeeding, have been categorized as ” of little importance.”

There is more work to be done for marginalized groups and many other aspects around the world and at the level of languages. Our evaluation and data set focus on women, which excludes many other groups, including people without sex. Articles about transgender and genderless people tend to be longer, but a lot of the extra space is devoted to their personal lives rather than expanding on one’s accomplishments, according to a 2021 study of social biases in Wikipedia articles. It is important to realize that bias exists in various forms, particularly in hypothetical online sources of information.

We are excited to share this as an important area of ​​research with the intergenerational community at large. We hope that our technology will eventually be used as a starting point for human writers at Wikipedia, and that it will eventually lead to more equitable information on the Internet that can be accessed by students who write biographies, and so on.


Please enter your comment!
Please enter your name here