Submissions/CLEF 2011 - Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites

This is a rejected submission for Wikimania 2012.

Submission no.

606

Title of the submission

CLEF 2011 - Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites

Type of submission (workshop, tutorial, panel, presentation)

Presentation

Author of the submission

とある白い猫

E-mail address

to.aru.shiroi.neko@gmail.com

Username

とある白い猫

Country of origin

Residing in Brussels

Affiliation, if any (organization, company etc.)

Wikimedia websites (Wikipedia, Commons, etc.)

Personal homepage or blog

none

Abstract (please use no less than 300 words to describe your proposal)

See #Abstract section

Track (Wikis and the Public Sector/GLAM - Galleries, Libraries, Archives, and Museums/WikiCulture and Community/Research, Analysis, and Education/Technology and Infrastructure)

Technology and Infrastructure

Length of presentation/talk (if other then 25 minutes, specify how long)

25 Minutes

Will you attend Wikimania if your submission is not accepted?

Yes

Slides or further information (optional)

Special request as to time of presentations (for example - can not present on Saturday)

None

Abstract

Artificial Intelligence

Breakdown of content on
Wikipedia main namespace[1]

Language Edition	All Pages	Content Pages	Deleted Pages	All/Del Ratio	Cnt/Del Ratio
English	9,272,208	3,933,153	2,387,906	0.7952	0.6222
German	2,337,921	1,383,695	1,014,441	0.6974	0.5770
French	2,410,253	1,226,669	517,845	0.8231	0.7032
Dutch	1,474,132	1,032,487	217,583	0.8714	0.8259
Italian	1,350,753	909,979	428,606	0.7591	0.6798
Spanish	2,190,060	878,116	752,814	0.7442	0.5384
Polish	1,145,943	885,712	327,161	0.7779	0.7303
Russian	1,733,689	835,022	373,430	0.8228	0.6910
Japanese	1,279,097	803,157	205,484	0.8616	0.7963
Portugese	1,283,345	717,771	419,129	0.7538	0.6313

Breakdown of content on
Wikimedia Commons

Filetype	Number of files (in use)[2]	Number of files (deleted)[3]	Content
midi	2,125	353	audio
wav	4	39	audio
ogg	159,522	5,945	audio/video
mp4	1	80	audio/video
gif	126,978	73,710	image/animation
jpeg	10,396,055	1,154,564	image
png	896,414	211,636	image
svg+xml	522,940	31,084	images, vector
tiff	83,546	2,235	image
vnd.djvu	20,680	1,342	image
x-xcf	271	118	image
vnd.ms-office	1	1	?
x-c	1	0	?
pdf	20,781	9,691	mixed text & images

Wikimedia Commons

There are 105,396 galleries on Commons
There are 12,395,328 files on Commons
There are 1,539,091 deleted files on Commons
- ~%11.04525 of the existing files are deleted

Statistics by DaB.

Artificial Intelligence (AI) is a branch of computer science that makes use of machines/agents/computers to process information to find patterns in relationships and use this to predict how to handle future data. Artificial intelligence has grown in its use particularly in the past decade with applications ranging from search engines to space exploration.

Since its creation Wikipedia and other Wikimedia projects have relied on volunteers to handle all tasks through crowdsourcing, including mundane tasks. With the exponential increase in the amount of data and with improvements in Artificial Intelligence we are able to delegate mundane tasks to machines to a certain degree. Currently Wikimedians are dealing with an overwhelming amount of content. To better express just how much information we are dealing with currently, see the table to the right.

Key problem with Artificial Intelligence research is researchers are often not experienced Wikimedians so they do not realize the potential of tools Wikimedians know and take for granted. To give an example, only a few people outside of the circles of experienced Wikimedians know that images deleted on Wikimedia projects aren't really deleted but just hidden from public view. One researcher I talked to called the deleted image archive of Commons a "gold mine". Indeed in any kind of machine learning task classified content (in case of commons that could very well be seen as "wanted" and "unwanted" content) can lead to supervised learning. You can have a system that uses deleted content, deletion summaries, content on the deleted image description pages to determine if other similar unwanted content exists that may need to be deleted or if newer uploads are similar to deleted content. This is just one of the many examples where artificial intelligence can assist editing.

To expand on the idea, tools such as Copyscape and TinEye are not customized to specifically serve Wikimedia projects. Their general purpose accuracy as a result is limited which in turn means their use to satisfy the needs of Wikimedia projects is limited. Innovative use of AI methods such as information retrieval, text mining and image retrieval can lead to more advanced tools.

CLEF 2011

Report on CLEF 2011: Participation:Presenting at PAN Lab of CLEF 2011/Report

CLEF (Cross-Language Evaluation Forum) conference has various tracks on Artificial Intelligence on text, image and even audio mining. The conference is divided into presentations and workshops. Each workshop track has sub-tasks that diverge into more specialized fields where competing implementations are ranked. The diagram to the right could be seen as an example of one of the many Workshops.

CLEF 2011 had a participation of 174 registered participants, 52 students in other words 226 people from 29 countries or 5 continents. The international makeup of the conference CLEF utilizes scientists world-wide even though it is known to be more of a European conference. Unlike its more business oriented counterparts, CLEF is more research prone making its goals compatible with non-profit projects and organizations.

Structure of PAN

I have attended CLEF 2011 as a participant and my presence there was through a grant by Wikimedia Deutschland. Aside from presenting my own research I have spent the remainder of my time to analyze the potential it may have had for Wikimedia projects such as Wikipedia and Commons in particular. Admittedly I was quite surprised that a significant majority of researchers as well as keynote speakers stated that they made use of Wikimedia projects as a source of raw data for research purposes at some point if not for their current topic of research. Such research can generate new innovative tools to handle mundane tasks automatically or semi-automatically so that human editors have more time left to work on other tasks.

It is in my belief that with little effort CLEF could become an indispensable asset for Wikimedia Foundation related projects as researchers working for CLEF already use Wikimedia projects. Particularly PAN and ImageCLEF labs could assist in dealing with issues wikis face such as automated identification of copyrighted material (text and images), automated tagging of images (for example for the image filter already approved by the board of trustees and community through the referendum), semi-automated categorization of images. This in turn would lead to human editors having more time for other more creative tasks. Another thing to note is that foundation had practically no presence in the CLEF 2011 conference even though foundation run projects dominated discussions in practically all of the tracks.

Some Artificial Intelligence ideas for the presentation

Wikipedia
- Copyright/Plagiarism Detection: Semi-automatic identification of copyrighted content stolen from external sources
  - A large proportion of copyright violations are automatically blanked and tagged by EN:User:CorenSearchBot on the English language Wikipedia.
- Author Identification: Semi-automatic identification of returning banned users as well as meatpuppets
- Vandalism Detection: Semi-automatic identification of vandalism
  - A large majority of vandalism on the English language Wikipedia is automatically screened out by the edit filters or reverted by EN:User:ClueBot_NG.
- Disambiguation: Semi-automatic automatic identification of disambiguation links to link them to the proper page
- Category Identification: Semi-automatic automatic categorization of articles
- Correlate real life events: Semi-automatic automatic identification of content for current events
Wikisource
- OCR for wiki: OCR developed to assist importing scanned content to Wikisource
Wikimedia Commons
- Unwanted: Semi-automatic identification of unwanted content (copyright violations, vandalism/trolling oriented uploads, non-project scope uploads)
- Controversial: Semi-automatic identification of controversial content (nudity, violence)
- Categorization: Semi-automatic categorization of images
  - Sounds something like what Commons:User:CategorizationBot and others do on Commons.
- Plant identification: Semi-automatic identification of plant features to assist in species identification
Wikimedia servers
- Performance: Performance analysis to predict how well each server is doing, predict server problems before they go critical, identify the cause
- Cyber Defence: Methods such as anomaly detection to identify intrusion activity on the servers
Wikimedia Foundation
- Sentiment analysis of social media and the web: Datamine to identify sentiments towards the foundation itself and towards foundation decisions

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

NaBUru38 15:39, 5 February 2012 (UTC)[reply]
Bináris 10:32, 13 February 2012 (UTC)[reply]
Looks interesting. CT Cooper · talk 20:36, 14 February 2012 (UTC)[reply]
Houshuang (talk) 00:57, 11 March 2012 (UTC)[reply]
Daniel Mietchen - WiR/OS (talk) 22:43, 18 March 2012 (UTC)[reply]
Zellfaze (talk) 15:13, 19 March 2012 (UTC)[reply]
Psychology (talk) 13:22, 3 April 2012 (UTC)[reply]
Thuvack (talk) 17:45, 21 April 2012 (UTC)[reply]
Add your username here.