Big Data has transformed human lives with its innumerable applications and widespread coverage. With the rise of technologically-enabled platforms like social media that connects people across boundaries, irrespective of their age, caste, colour and creed, big data has found many applications in fields related to social science. Most conventional research works in social science deal with collection of data and its analysis at the community and population level. However, the next-generation social scientists are dependent on technologically driven systems for data collection and empirical analysis of socio-economic research problems. With that said, social scientists are yet to explore and understand the true potential of big data technologies and data science in making their lives and work easier.
Recent technological breakthroughs have given face to many new types of data. Some of the predominant forms of such data include Internet data, mobility data, administrative data and geospatial data. Internet data is accumulated from multiple sources like search engines and social media platforms or social machines like twitter and instagram. On the other hand, mobility data is gathered from GPS-based systems like Google Maps while the data collected in the form of satellite images and weather parametersare referred to as geospatial data. Lastly, government organizations and institutions are primary contributors and monitoring agencies of open data for different sectors like education, public sector banking and businesses, in addition to many others.
Social science research encompasses some standard procedures of which data collection, storage, management, analysis and results production are major. Conventionally, researchers identify a problem, form a hypothesis and collect data to test the hypothesis in the form of interviews and surveys. Typically, if hand-written forms are collected, then data needs to be entered into systems like MS Excel sheets or databases. Inbuilt software of MS Excel or database management systemsare used to perform statistical analysis and get the results. Therefore, storage, processing and visualization are performed using software solutions like MS Excel or GUI of a database management system.
This is an effective operating procedure for data that is manageable at the system level or data that is available in a structured format. However, if the data is collected from Internet sources or social media, following this set procedure may not be possible as conventional systems are incapable to handle data of such high volume, variety and velocity. The data pipeline for social science research can be considered a scaled-down version of the big data pipeline, which entails the following steps - data acquisition, storage, processing and visualization.Therefore, in this big data era, the world is no longer the same for social scientists and researchers. However, before a technological analogy for social research can be established, it is important to understand that research in social science offers some specific challenges to technological development, in addition to adoption and usage issues in developing countries like India where technical expertise and language literacy are primary limiting factors.
Traditional techniques for data analysis need to be revisited for all stages of data management from acquisition to interpretation and visualization. Technologically, one of the biggest challenges is to process multi-lingual datasets. Moreover, data is usually available in a fragmented format across diverse sources. Therefore, accumulation of data might require translation of data from one language to another or one scientific discipline to another. Besides this, other issues like missing data and unreliability in data are also major concerns with respect to efficacy of such a data-driven approach for decision-making. Therefore, solutions must cater for approaches that can fix potential data issues and assess the degree to which such an inaccuracy can affect the outcome of the process. Finally, social media generates data at a remarkably high velocity. Therefore, all the process of the data pipeline must be equipped to work in real-time.
The potential of big data technologies in making results statistically relevant, shall facilitate research solutions that are generic and effective in solving the identified problems. However, in view of the challenges that this research domain offers to technologists and big data community, we have a long way to go before domain-specific solutions for social science can come into existence.