From electric cars to online banking, our technology relies on vast amounts of data that can be gathered from the world around us.
The ubiquity of data—our location, the temperature outside, and even our shopping habits—is what makes it useful, but it also creates a gateway to threats that need to be handled carefully.
Hertz Fellow Kathleen Fisher wants to make sure that all data, no matter where it comes from, is safe for the software that relies on it.
“The security of software systems is crucially important to the well-being and function of the world,” she says. “For the last 30 years, we've taken software for granted; that it works well enough.”
Fisher has spent her career making software easier to write and safer to use. She spent 15 years at AT&T Labs making it easier to develop high-quality software quickly. When she was a program manager at the Defense Advanced Research Projects Agency (DARPA), her team was responsible for building tools to make it easier to write software that verifiably did not have certain kinds of flaws, better known as "bugs," that are a weak point where hackers can attack.
Now a computer scientist and department chair at Tufts University, Fisher is building software that can recognize harmful or messy data before it gets integrated into our systems in the first place.
Making Sense of Messy Data
Dealing with messy data is something akin to cleaning out a junk drawer. One can easily sift through the bits and bobs and sort important pieces from hazardous items. As you figure out what's in the drawer, what's missing, and if there are any items to be ignored, you generate a list of seemingly random, ad hoc data.
Only instead of using a human to recognize and categorize all that junk, Fisher wants to use software.
To do this, Fisher is focused on a project called PADS (short for "Processing Ad Hoc Data Sources") that was started years ago while she was at AT&T Labs. Portions of it are currently funded by DARPA. The premise of PADS is to take any data—messy, strangely organized, unlabeled—and get it into a format that is convenient for downstream analysis.
“PADS is an example of a data description language,” says Fisher. “Such languages are used to describe data formats that appear in the wild.”
One example of data that needs to be sorted to make sense are CSV data, or comma-separated values, commonly used in spreadsheets. Other examples include binary packet formats, system log files, and scientific data sets in fields ranging from biology to physics.
Any company or organization that regularly ingests a lot of data and has to sift through it quickly for valuable information would benefit from PADS. For example, AT&T collects huge amounts of data—billing data, user data and downloads, international calling—and PADS could help make sense of the firehose of data collected daily.
Making Data More Accessible
Fisher wants her work on PADS to facilitate faster ingestion of data into analysis pipelines. The PADS compiler can produce tools that not only parse data into forms that computers can manipulate but also can translate it into more standard formats or perform statistical analyses, making the data more accessible and useful.
Fisher sees PADS as a tool that enables more impactful work in other fields. She says it can help answer questions from large amounts of data in ways that couldn’t be done otherwise. For instance, financial institutions could use it as part of an analysis pipeline to monitor and verify transactions, healthcare professionals could use it to evaluate historical data on treatment options, and researchers could incorporate historical data to expand their sampling records.
PADS would also benefit universities and other distributed organizations that deal with varied and enormous amounts of data flying through their networks. A university IT department, for example, may observe packets of information that could be anything from a graduate research project to a malicious hack from an attacker. PADS would help the staff understand the purpose of that information so they could take action if necessary.
“You can use PADS to recognize things that are sort of mysterious without having to really have someone sit down and go through them," Fisher says.