'Clean Data' offers clear guidance for data scientists
It takes patience to correct misspelled words, convert file formats and remove incomplete records when analyzing large amounts of information contained in databases and spreadsheets. A new book by Elon University Professor Megan Squire helps to make such tasks easier.
A database or spreadsheet is only as good as the records it contains. If certain information is missing, or words are misspelled, or unusual characters have appeared because two different software programs don’t read a file the same way, then you might have a problem.
What can be done to “clean” those types of records? How can people who work with large data sets avoid problems altogether? More importantly, what are the ethics of sharing your data?
Elon University Professor Megan Squire answers those questions and more in “Clean Data,” published this spring by Packt Publishing. Squire described her first authored book as a “collection of techniques” written for data scientists of all abilities.
The book highlights the importance of data cleaning in data science and reinforces basic concepts like file formats, data types and character encodings. It teaches data scientists how to extract and clean data stored in various file formats, including PDFs, and it provides practical examples for readers.
“To me, this is a valuable way to teach people how to get stuff done, and we can enable other researchers,” said Squire, who was inspired to write “Clean Data” after reading a statistic in the New York Times about data scientists. “You may still spend 80 percent of your time doing prep work, but that prep work will be easier and more efficient.”
Squire’s book contains chapters on file formats, file compression, Unicode and mining Twitter for data. It also includes information on making data public for others to use. Squire said most data sets - especially those that are used to support research findings - shouldn’t be kept a secret.
“That is a path to bad science,” she said. “A lot of times it starts innocently. People are just shy about their work!”
The book couldn’t come at a better time for Squire’s potential readers. The Harvey Nash 2015 CIO Survey, conducted in association with KPMG, found that the demand for big data analytic skills has “leapt to the number one most-needed skill, skyrocketing to almost six times higher than the next-most-scarce skill, change management.”
“In the 17 years we have conducted the survey … we have never seen demand for a skill expand so quickly as we have for big data analytics,” Albert Ellis, CEO of Harvey Nash Group, said in a May news release announcing results of the London-based international consulting firm’s most recent survey.
Squire joined the Elon faculty in 2003 and has also worked at several technology startups in North Carolina and Florida. She teaches courses in data mining, web development, database development, introductory software development and data science, among other classes.
Her research is in the area of free and open source software, specifically the collection, curation and federation of large amounts of data about how free and open source software engineering projects are developed. Currently, she collects, aggregates, stores, cleans, and mines data about free, libre, and open source software development.
Squire co-founded and leads a project called FLOSSmole, a team of software developers who write programs to collect and analyze this FLOSS data, and then freely provide the results back to the FLOSS research community.