Skip to Main Content

Digital Scholarship - Online Tools & Resources

An Introduction to Hathitrust Extracted Features Dataset

The HathiTrust Digital Library is a partnership of major research institutions and libraries that works to build a comprehensive digital archive of materials converted from the print collections of these institutions. It has a particular emphasis on creating access for individuals with print disabilities. The library decided to expand the opportunities for research with pre-extracted data features by preparing a data export of features for public domain volumes of the HathiTrust Digital Library, called the HTRC Extracted Features Dataset.

This dataset is comprised of page-level features for both public-domain and in-copyright volumes in the HathiTrust Digital Library. These features introduce users to new data and new options for viewing and organizing this data. Features are defined as informative characteristics of a text, and include header and footer identification, part-of-speech tagged token counts, and various line-level information.  Over 2 trillion tokens (or words) are currently included in the collections of Hathitrust, which are made up of 13.6 million volumes.


Find out more:

https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset

https://analytics.hathitrust.org/datasets

https://github.com/htrc/htrc-feature-reader

 

As the information is provided per page, users can more easily distinguish certain text from paratext; for example, advertisement text can be differentiated.Characters at the beginning and the end of lines can also be used to differentiate text from a paratext on a page.            

  a screenshot of the word "scholar" put through the Hathitrust Extracted Features Dataset

Term counts are specific to the part of speech usage for the term, so words with multiple part-of-speech usages are counted separately for these usages. For example, terms used as both a noun and a verb will have separate counts.

Additionally, line information (such as the number of lines with text on a page, and character counts of the start and end of lines on every page), can help clarify the genre or volume structure of a certain work.

Researchers often use this data to further investigate hypotheses they have about certain texts, or relationships between texts. One such hypothesis that can be tested is the occurrence of a certain word in a certain author’s book in comparison to the rest of their novels.

Here are some examples of how Scholars and researchers are using the dataset to create new projects for different purposes.


The Word Similarity Tool is a web-based tool for viewing similar words to a query, for each year from 1800 to 1923. A search for a word will reveal the words that occurred in similar contexts (appeared next to each other in a writing) and how these occurrences changed over time.

https://mimno.infosci.cornell.edu/wordsim/nearest.html


The HT+ Bookworm is a visualization of terms that displays the frequency of words per million over time.

https://bookworm.htrc.illinois.edu/