Picture of William Cohen

William W. Cohen

Visiting Professor, Bio | Announcements and FAQs | Teaching | Publications (recent, all) | Software | Datasets | Talks | Students & Colleagues | Other Stuff ]

Prospective visitors/students: see announcements


William Cohen is a Visiting Professor at Carnegie Mellon University in the Machine Learning Department. He also holds a position as a Principal Scientist at Google, where he worked full-time between May 2018 and March 2024. He received his bachelor's degree in Computer Science from Duke University in 1984, and a PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company specializing in extracting information from the web. From 2002 to 2018, Dr. Cohen worked at Carnegie Mellon University in the Machine Learning Department, with a joint appointment in the Language Technology Institute.

Dr. Cohen is a past president of the International Machine Learning Society. In the past he has also served as an action editor for the the AI and Machine Learning series of books published by Morgan Claypool, for the journal Machine Learning, the journal Artificial Intelligence, the Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research. He was General Chair for the 2008 International Machine Learning Conference, held July 6-9 at the University of Helsinki, in Finland; Program Co-Chair of the 2006 International Machine Learning Conference; and Co-Chair of the 1994 International Machine Learning Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI Conference on Weblogs and Social Media, which was held May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference on Weblogs and Social Media. He is a AAAI Fellow, and was a winner of the 2008 the SIGMOD "Test of Time" Award for the most influential SIGMOD paper of 1998, the 2014 SIGIR "Test of Time" Award for the most influential SIGIR paper of 2002-2004, and the 2023 Semantic Web Science Association's Ten-Year Award for the most influential paper of the ISWC-2013 conference.

Dr. Cohen's research interests include include question answering, machine learning for NLP tasks, and neuro-symbolic reasoning, and he has a long-standing interest in statistical relational learning. He holds seven patents related to learning, discovery, information retrieval, and data integration, and is the author of more than 300 publications.

Announcements and FAQs


For now my old course notes and lectures are avilable through CMU.

Software and demos

  • Enron email dataset (400Mb, once you get there) contains 800,000+ emails from 150 users+ organized into 4700+ folders.
  • classify.tar.gz (0.4Mb) contains nine problems in which the goal is to classify short entity names. This data was used in Joins that Generalize: Text Classification Using WHIRL (KDD-98).
  • match.tar.gz (0.7Mb) contains a suite of labeled entity-name matching and clustering problems (i.e. problems for which the correct matches/clusters are provided), in a single consistent format. In most cases WHIRL's performance is given as a benchmark. (These are also distributed in the RIDDLE Repository. Extraction-oriented versions of some of this data are available on the RISE Repository. (I.e., represented as a problem of extracting data from a website, rather than matching two datasets).)
  • whirl-bench.tgz (1.1Mb) contains some more WHIRL-format entity name matching problems.

    Talks and presentations


    Recent papers I'm keeping in HTML or PDF (which requires Adobe Acrobat Reader to view). Older papers are mostly in Postscript. For Windows, I use the GSView reader for postscript. Most of these papers are viewable in several formats in ResearchIndex.

    Students and other colleagues

    Current students:

    Former students/colleagues: