Global Home ICML 2003 Workshop
ICML 2003 Workshop
The Continuum from Labeled to Unlabeled Data in Machine Learning
and Data Mining
The
20th
International Conference on Machine Learning (ICML 2003) will be held in
Washington, DC, August 21-24 2003. It will be co-located with the
ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining (KDD
2003) and Conference On Learning Theory (COLT 03).
Important Dates Papers
Due: May 5 Notification: May 25 Final Version Due: June 26 Workshop
Date: August 21
Organizers: Rayid Ghani Accenture Technology Labs
Rosie
Jones Overture Services
Chuck Rosenberg Carnegie Mellon University Workshop Description There is a spectrum of
ways to use data in machine learning and data mining. At the one end is
completely unsupervised learning or clustering, and at the other end is
supervised learning where the target output is known for every example. This
workshop aims to explore the space between these two extremes. Techniques that
have been proposed include learning from unlabeled data with hints, learning
from unlabeled and positive-only labeled data, learning from distantly and
noisily labeled data, combining labeled and unlabeled data with cotraining, EM
and other semi-supervised techniques, and transductive learning, where the test
data is added as an additional source of unlabeled data. The possible sources
of labels and hints are also broad: systematic hand-labeling, labels acquired
through active learning, and hints derived from domain knowledge are among the
techniques that may be used.
The goal of this workshop is to bring together researchers from different
fields to talk about their different perspectives on this intersection and to
share their latest ideas. We see the workshop as a venue not only for the
presentation of papers focusing on exploiting unlabeled data, but also a forum
for sharing ideas across different application domains. In particular it is an
opportunity for discussion of techniques that are applicable to multiple types
of datasets, and experiments across many points in the continuum from
unsupervised to supervised learning. The use of domain knowledge as a source of
partial supervision, and the generation of examples to be labeled by domain
experts though active learning are of particular significance in the data
mining context. We are also interested in promoting discussion to develop
diagnostic techniques that can inform the user whether unlabeled data is
helping or hurting the performance of the underlying learner.
We see this as a unique opportunity due to the co-location of
ICML with
KDD. With
this workshop co-located with KDD, we will target researchers from both
academia and industry who are involved in data mining to participate in the
workshop. For many data mining problems, large amounts of data have been
collected and the labels are either not known or are expensive to obtain. Such
examples include security applications (intrusion detection, anomaly
detection), CRM (customer interactions, transactional data, call center
applications), financial industry (fraud detection, loan defaults, banking),
targeted marketing and retail applications (supply chain optimization). Most of
these applications have large amounts of unlabeled data being captured but
rarely utilized. We encourage the participation of people working on practical
applications where some form of unlabeled data can be beneficial.
Workshop Format The workshop will consist of
both regular paper presentations, and debates.
Regular Papers Papers addressing novel types of
data, methods of diagnosing when unlabeled data will help and when it will
hinder, and applying techniques across multiple application domains and
multiple levels of supervision are particularly encouraged. Papers discussing
the acquisition of labels from real-world experts in real-world data mining
problems are also encouraged. Data mining practitioners working on real-world
problems with large amounts of captured/stored data but a high cost labeling
process are encouraged to submit problem descriptions and possible solutions.
Regular papers can be up to eight pages, and may address work in
progress. Papers should be in the format required for ICML submissions.
Problem Descriptions from Machine Learning/Data Mining
Practitioners Papers one to two pages in length describing a
problem domain you have encountered or dealt with where training data and/or
labels are very expensive or hard to obtain. The paper would present a problem
statement, give background on the domain, and list sources and amount of
available training data. We hope these types of papers will encourage
participation from people working on practical applications where unlabeled
data can potentially be valuable but is not currently utilized. We hope to
devote a session in the workshop to discuss these problems and brainstorm
possible solutions and ways to use unlabeled data for the problems posed in
these papers.
Debate Position Papers Position papers, one to
two pages in length, on either side of the following topics are solicited.
Accepted papers will be published in the workshop proceedings, and
authors will be expected to debate their position. Topics not on this list are
also acceptable, if you can coherently argue both sides, or can encourage a
colleague to submit the opposing position.
- Unlabeled data is only useful when there are a large number
of redundant features.
- Why doesn't The No Free Lunch Theorem apply when working with
unlabeled data?
- Unlabeled data has to come from the same underlying
distribution as the labeled data.
- Can unlabeled data be used in temporal domains?
- Feature engineering is more important than algorithm design
for semi-supervised learning.
- All the interesting problems in semi-supervised learning have
been identified.
- Active learning is an interesting "academic"
problem.
- Active learning research without user interface design is
only solving half the problem.
- Using Unlabeled data in Data Mining is no different than
using it in Machine Learning.
- Massive data sets pose problems when using current
semi-supervised algorithms.
- Off-the-shelf data mining software incorporating labeled and
unlabeled data is a fantasy.
- Unlabeled data is only useful when the classes are well
separated.
Schedule To be decided later
Organizers Rayid Ghani Accenture Technology
Labs, 161 N. Clark St, Chicago, IL 60601 +1 (312) 693-6653
Rosie
Jones Overture Services, 74 N. Pasadena Ave 3F, Pasadena, CA
91107 rosie.jones@overture.com +1 (626)229-8536
Chuck
Rosenberg Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh,
PA 15213 chuck@cs.cmu.edu +1 (412) 268-8078
Program Commitee Kristin Bennett, Rennselear
Polytechnic Institute Mark Craven, University of Wisconsin Zoubin
Ghahramani, Gatsby Computational Neuroscience Unit, UCL Sally Goldman,
Washington University, St. Louis Tony Jebara, Columbia
University Thorsten Joachims, Cornell University Stefan Kremer,
University of Guelph Bing Liu, National University of Singapore Andrew
McCallum, University of Massachusetts Ray Mooney, University of Texas,
Austin Ion Muslea, University of California, Irvine Kamal Nigam,
IntelliSeek Ellen Riloff, University of Utah Dale Schuurmans,
University of Waterloo Martin Szummer, Microsoft Research,
Cambridge Sarah Zelikovitz, City University of New York Tong Zhang, IBM
Research, Yorktown Heights
To Top |  |





|