About Columbia Newsblaster


Columbia Newsblaster is a system to automatically track the day's news. There are no human editors involved -- everything you see on the main page is generated automatically, drawing on the sources listed on the left side of the screen.

Every night, the system crawls a series of Web sites, downloads articles, groups them together into "clusters" about the same topic, and summarizes each cluster. The end result is a Web page that gives you a sense of what the major stories of the day are, so you don't have to visit the pages of dozens of publications.

Newsblaster is an academic project from the Natural Language Processing group at Columbia University's Department of Computer Science. It is designed to demonstrate the Group's technologies for multidocument summarization, clustering, and text categorization, among others. It is funded under DARPA TIDES and KDD and has been operational online since September 2001.

Current and future enhancements include international perspectives, multilingual capability, and tracking events across days.

Back to Newsblaster



Newsblaster FAQ

  • Does Newsblaster threaten to replace reporters?
    Absolutely not! Newsblaster collects, clusters, categorizes, and summarizes news, but it does not write news. It will always need human journalists for its raw content.

  • Can I obtain more technical information about Newsblaster?
    Please check out this page for a list of papers concerning Newsblaster and its components.

  • What additonal publications exist dicussing Newsblaster?
    For a list of articles from the press which discuss Newsblaster, please check out this page.

  • How is Newsblaster different from Google News?
    Google News does not do multidocument summarization; it simply uses the articles' leading sentences. In addition, Newsblaster produces multiple summaries for an event, each reflecting the media from a particular country. Future expansions such as tracking events across days are also in the works.

  • Why can't I search Newsblaster?
    You will soon be able to search Newsblaster's summaries, as well as the blasted articles themselves.

  • How come Newsblaster sometimes doesn't update every day?
    We sometimes deliberately cancel a day's run when we don't want the Web page to change. This happens when we are preparing to present the system at a demo or site visit. Network problems and code bugs can also come up.

  • Can I license the code for Newsblaster or make it run on my own data?
    We are currently discussing plans by which we may be able to either license Newsblaster code or run it ourselves on other people's data. It is not yet clear when we will be able to do this. If you are interested, please contact blaster@cs.columbia.edu.

  • Is the Newsblaster code free or open?
    Sorry, there are no plans to make Newsblaster open source.

  • How does Newsblaster make its summaries?
    Newsblaster uses two different summarizers. One carefully selects sentences from among the articles and rearranges them to produce a coherent summary. The other looks for common information conveyed across all the articles and then reformulates new sentences expressing that information. After a summary is generated, it is then revised for greater fluency.

  • What platforms and languages are used to run Newsblaster?
    Newsblaster currently exists as a collection of programs, scripts and tools which run on both the SUN/Solaris and Linux operating systems. We have written components in Java, Perl, C, and shell scripting languages. A typical run, including crawling the Web, downloading documents, clustering, categorizing, and summarizing, currently takes 4-12 hours, depending on which summarizers are used.

  • Can Newsblaster be updated more often than once a day?
    When the overnight run finishes soon enough, Newsblaster runs an "incremental" run in the afternoon. That's when you will see stories marked as "NEW."

  • Who's writing and maintaining Newsblaster? How long did development of Newsblaster take?
    Work on Newsblaster started in the Fall of 2001, and is still ongoing. Many of the components that are used within Newsblaster had been developed under previous projects dating back to 1996. Development of Newsblaster is more active than ever at this time; you can see a list of the members of the Newsblaster team.

  • Why are some news sites used and not others? Can you add my site?
    We tried to choose common and popular news sites from the Web, and we have occasionally added new sites as requested by users. If you have a site you would like to recommend to us, please contact blaster@cs.columbia.edu.

  • Can I talk to anyone involved with Newsblaster directly?
    Feel free to contact blaster@cs.columbia.edu, which is monitored by team members, with any suggestions or comments.

  • What other similar projects exist?
    Aside from Google News, the only similar project we are aware of is NewsInEssence, developed at the University of Michigan. It's available at www.newsinessence.com.

  • I noticed errors that Newsblaster made!
    It's very difficult to write software that accurately deals with natural language. Summaries will sometimes contain bad English and other mistakes. Clusters occasionally end up with unrelated articles.


Up to top
Back to Newsblaster




The Newsblaster Team

As of May 2003, the Newsblaster team consists of:

Principal Investigators
Kathleen McKeown
Judith Klavans
Vasileios Hatzivassiloglou

Professors
Luis Gravano

Postdocs
John Chen

Students
Regina Barzilay (graduated)
Wisam Dakka
David Evans
Ani Nenkova
Carl Sable (graduated)
Barry Schiffman

Programmers
David Elson
Sergey Sigelman
Michael Tanenblatt


Up to top
Back to Newsblaster