A Blog recording the life a Web Scientist
Web Science 2014 Conference – Summary and Thoughts
The following post provides a summary of the activities and research that was presented at the Web Science 2014 Conference, held in Blooming, Indiana.
Web Observatory Workshop
The Web Observatory workshop (Link), which was led by members of the Southampton SOCIAM team built upon the past three workshops and various research activities related to the ongoing efforts of building Web Observatories. The emphasis of this workshop was for current research that described the implementation and use of different types of Web Observatories, how they are built, and what they are being used for. The workshop contained 5 presentations which was followed by a round-table discussion focusing on the best practices and next steps required to interoperate between Observatories.
Professor Noshir Contractor gave an excellent opening keynote, with a strong emphasis on methods, and the benefits of making data available Web Observatories. Along with the advantages described, Noshir made the excellent point that we need to be aware of analytical dangers, and we need to make sure we are not looking only where the data is, we need to expand beyond the traditional watering holes (e.g Twitter). Complementing Noshir’s argument, Siripen Pongpaichet’s (Link) presentation of EventShop demonstrated how we need to think about data beyond the Web. EventShop provides a platform to draw together Web and environmental data in order to provide richer insight into real-world conditions. However, this raises the question of the boundaries of Web Observatories, and what types of data should we be dealing with? Should we be trying to cater for this type of data as well? Whilst there was a strong agreement that this data is important, there was consensus that the initial focus should be on a number of core datasets (whatever they may be), and then place an emphasis on trying to bring together other datasets. Following this, Thanassis Tiropanis and I gave a presentation on the Web Observatory work under way at the Web and Internet Science group (WAIS), and within SOCIAM, describing the current efforts in developing a three tier architecture which separates the data stores, data cataloguing, and analytics/visualisation components (Link). Over the last 9 months there has been much effort invested into developing the necessary infrastructure to support storing, accessing, and querying large-scale Web streams, which an increasing emphasis on offering streaming access and analytics across multiple streams of Web data.
Another strong theme that ran though the presentations and round-table discussion was the strong push towards using a common schema to describe the datasets that are going to be listed on the Web Observatories around the world. The only way that this is going to grow beyond the single silos of institutions, organisations, and businesses is that when we list our datasets, we use the common schema. There has already been a lot of work on this, and we have developed a common schema that is now available on schema.org (Link).
The discussions that were had in this workshop were echoed throughout the main conference, where there was a lot of discussion around the use and availability of data, as well as what type of methods can be applied to these various forms of data?
Web Science Main Conference
Following a full day of workshops, the main conference consisted of a 3-day, single track programme, comprising of morning keynotes, followed by a series of presentations of varying formats including pecha kuchas and standard presentations. I was fortunate enough to have two submissions accepted and have the opportunity to give a pecha kucha and poster to talk about the ongoing Citizen Science research, and a 15 minute presentation, in which I presented the current research on sociological methods for Big Data, with my colleagues, Professor Susan Halford, and Olivier Philippe. The poster presented the latest findings of the analysis of citizen scientists and their activities and engagement as a community. Following a stream of studies, the findings of this analysis discovered the clustering of participant discussion forum activity on the Galaxy Zoo forums, which may correlate to specific motivation of participation.
The second paper presented discussed the methodological challenges and potential solutions to engaging with Big Data sources whilst appreciating the complexities and pitfalls of quantitative big data analysis. The main argument and contribution of this work is showing that mixed methods, even when dealing with extremely large sources of data is essential for understanding the nature of the data, and to avoid the ‘Twitterology’ of misrepresenting the data or the context. The presentation resulted in a good level of discussion around the role of interdisciplinary thinking and methods, with a strong agreement that we (the Web Science community) continue to engage and develop methods that bridge across disciplines in order to better understand what we are observing. In fact, Daniel Tunkelang’s opening Keynote was a great start to the conference, as it was truly inline and with the topics and themes that were ripe for discussion throughout the conference. Daniel discussed issues of big data, methods, correlation and causation, human interaction, and understanding human behaviour beyond the aggregated counts offered by statistics.
One of the core themes that persisted throughout the conference presentations was the discussion and analysis of different Web systems, their behaviour and characteristics. Many of the presentations focused on specific ‘features’ of Web systems such as Wikipedia, Facebook, Tumblr, Twitter. As a nice summary, Harith Alani’s presentation on “Mining and Comparing Engagement Dynamics across Multiple Social Media Platforms” (Link) mapped the studies related to the various systems that the Web Science community have been analysing, illustrating how the number of systems analysed has grown since the initial Web Science conference. Whilst this is not surprising given the growth of these systems, it is interesting to see the focus around specific social machines such as Twitter. This raises a number of questions regarding our focus; are we still placing a lot of resources analysing these systems because they are easy to access, is it because they are the most successful (and what is the definition of success), or is it because we are still trying to develop methods to better understand the systems and the data? These questions are extremely relevant to SOCIAM given that we have focused on similar topics in SOCIAM for quite some time. The research presented is extremely useful and to the work we have been doing in trying to classify the behavioural (and observational) characteristics of social machines. This type of classification will not only help provide generalizable features of different types of social machines, but also provide us with a framework to help observe them. However, whilst we are focusing on the successful and high profile social machines, we also need to be aware of those that are hidden, or failing. What are the characteristics of those social machines that do not gain popularity, large user bases, or obtain substantial interaction with others in the Web eco-system? Something yet to be understood.
On reflection, Web Science 2014 was a great conference with a vibrant and growing community. This year there was a surprising amount of buzz around the need to overcome Twitterology and the Big Data hype; interdisciplinary methods was the message of the day. In SOCIAM, we are always keen to present the 4-quadrant diagram of people vs machines (see below, courtesy of Professor Dave De Roure), and we consider social machines to fall into the top right quadrant, more people, more compute. We are now at a point where we need to consider this in terms of methods. We are well aware that big data =/= large scale (quant) methods, nor does small data =/= small scale (qual) methods. Whilst this requires a much lengthier discussion regarding the epistemological and ontological boundaries and limitations of both the quantitative and qualitative paradigm, perhaps we are working towards something as illustrated in Figure 2?