Cross-post from http://idibon.com/privacy-with-language-technologies/
One downside of the kind of technologies that we build at Idibon is that they can be used to compromise people’s privacy and, by extension, their safety. Any technology can be used for positive and negative purposes and as engineers we have a responsibility to ensure that what we create is for a better world.
For language technologies, the most negative application, by far, is eavesdropping: discovering information about people by monitoring their online communications and using that information in ways that harm the individuals. This can be something as direct and targeted as exposing the identities of at-risk individuals in a war-zone or it can be the broad expansion of government surveillance. The engineers at many technology companies announced their opposition to the latter with a loud, unified call today to reform government surveillance.
One way that privacy can be compromised at scale is the use of technology known as “named entity recognition”, which identifies the names of people, places, organizations, and other types of real-world entities in text. Given millions of sentences of text, named entity recognition can extract the names and addresses of everybody in the data in just a few seconds. But the same technology that can we used to uncover personally identifying information (PII) can also be used to remove the personally identifying information from the text. This is known as anonymizing or simply “scrubbing”.
By scrubbing data of personally identifying information we can share that data more broadly while reducing the risk of the people mentioned. An example might look like this:
- I’m ********** at the corner of ********** and ********** where we have a dozen children who require shelter and medicine.
After Hurricane Sandy, Recovers’ software was used in New York City to post important information and collect donations, volunteers, and specific individual/community needs. Recovers collected information from about 20,000 volunteers and donors, and helped organizers meet hundreds of requests. While there is a benefit in sharing this data to understand how information is exchanged in disaster response contexts, the data also contained the personal details of many at-risk individuals and other information such as people’s email and personal addresses connected to their names.
By using Idibon’s advanced language technologies to identify and scrub personal information, we were able to efficiently and comprehensively remove the personally identifying information from the 20,000 pieces of communication in the Recovers data about Hurricane Sandy. We were able to do this within our systems for just a few hundred dollars: privacy does not need to be expensive. This will enable a larger number of people to learn from Hurricane Sandy by examining the types of needs, resources and communication styles that are present, without further risk to those in New York.
– Rob Munro
Dec 9, 2013