Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI.

Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI

Outline  Introduction  Problem Statement  Proposed Platform  Task Execution Management Engine  Results  Conclusion

Introduction Human Computation “Some problems are hard, even for the most sophisticated AI algorithms.” “Let humans solve them... ”

Introduction  Using Humans as Computers  A very old idea: Humans were the first “computers”  Halley’s comet orbit,1758  Astronomical almanac with moon positions, used for navigation, 1760  Logarithmic and trigonometric tables, 1794  Math Tables Project, unskilled labor, 1938 Grier, When computers were human, 2005 Grier, IEEE Annals 1998

Introduction  Crowd : group of workers willing to do small duration and simple tasks on a crowdsourcing platform  Heterogeneous group  Members do not know each other and work independently  An individual member of such crowd is known as crowdworker or simply worker  Microtask: smaller, well defined sub-task derived from task decomposition  Can be done quickly by humans - few seconds/minutes of low cognitive load  Machine solution unsatisfactory  Either not solvable by machine algorithm  Or, poor quality  Or, would take significantly longer time than humans  Can solve independent of other microtasks derived from the same task Image tagging Image categorization Image digitization Text validation in images Object tagging in images Sentiment analysis of text Text classification Language translation Event detection in video Keyword spotting in audio Microtask Examples

Speech Speech transcription Speech translation Keyword spotting Sentiment analysis Document images Tagging Categorization Digitization/OCR Validating OCR Video Data collection Description/tagging Event detection Object recognition Text Creation/Data collection Language translation Sentiment analysis Categorization Images Data collection Description/tagging Object location Object recognition Demographics “I would rather eat a stone instead of this cake” “XYZ printers are not that bad” Task Is this a dog? o Yes o No Workers Answer: Yes Task: Dog ? Pay: $0.01 Broker $0.01 Human Intelligence Task (HIT) Microtask

ReCAPTCHA – Using Human Computation to Improve OCR 200M+ each day by people around the world Used on 350,000+ sites, digitizing 100M words per day, equivalent of 2.5M books a year

Crowdsourcing Microtask - Many Platforms Multiple Dimensions Experience Ease of Use Quality of crowd Satisfactory results (SLO) Cost advantage Privacy & security Infrastructure Source of crowd Work Definition support Work & Process Oversight Results & Quality Management Payment processing

Problem Statement Design Human-Machine Hybrid Computing System for Unstructured Data Analytics  Orchestrate machine and human computing resources to meet: Service level objectives (SLO) - budget, turnaround time, accuracy Scalability – handle big data volumes Reliability – resilience against computing resources unpredictablity Why this is hard Providing SLO guarantees under dynamic resource availability and quality conditions Algorithms do not meet quality expectations for unstructured data analytics Humans are unpredictable, slow, error-prone, required skills may not be available immediately

Problem Statement – Single Task No microtask-based crowdsourcing platform provides automated, runtime management of all three SLOs of accuracy, time & budget Complete task S with quality/accuracy of the results being at least A*, while ensuring that the total money spent is less than budget B* and the total time taken is less than T* Budget Time Accuracy Human+Machine Computing Agents We assume data parallel tasks – similar but independent microtasks with different inputs

Scheduling Task Graph Problem  Task graph nodes represent subtasks that are solvable using a human-machine agent system  Black Nodes  Only Machine Agents  White Nodes  Only Human Agents  Gray Nodes  Either Machine or Human Agents  SLO metrics (A*, B*, T*) given for entire workflow  Need to derive some initial SLO–goal estimates for individual nodes  Due to uncertainties in achieved (A, B, T) for a node, dynamic redistribution of SLO goals required  May require multiple iterations of the same task/node

Proposed Platform  Platform will provide 3 main interface libraries for users  Crowd Access Layer  Bring in crowd workers from crowdsourcing platforms (AMT, CrowdFlower, etc.) as well as social networking platforms (Twitter, Facebook, Google+, FourSquare, etc.)  Machine Abstraction Layer  Provide necessary APIs to allow users to plugin/register their own application-specific algorithms/software  Provide access to a standard toolbox of algorithms for text, image and video analytics using a software as a service (SaaS) model  Task Management Library  Expose a task-workflow specification interface : users can represent/decompose complex task as task dependency graph  Allow users ability to annotate tasks/nodes with SLO goals  Allow users to submit to the Task Execution Management Engine for execution

Task Execution Management Engine H M n microtasks nhnh nmnm System Model

Task Execution Management Engine Initial Constraints

Dynamic Microtask Execution Control

DMTEC Corrective Actions Mobile/captive workers special class of workers Microtask assignments can be ” pushed ” Will work on assigned micotasks with least average delay among all worker types Produces result quality comparable to expert workers

Experiment Results Used set of 250 tweets Each tweet categorized into one of 6 intent categories Promotion Appreciation Information Sharing Criticism Enquiry Complaint Experimented with 3 types of crowd Known, expert workers Workers from AMT with no special qualification or training Experienced workers from AMT who had taken our short training for qualification

Experiment Results Agent Type% AccuracyTime (days)Cost ($) Public (3 votes)57.2115 Public (5 votes)78125 Expert Workers (3 votes) 91.8137.5 Public Qualified (3 votes) 80.4715 HP Autonomy IDOL67.2< 11.25 Performance of different Agent Types Results for 250 tweets dataset

Experimental Results

Simulation Results Used Subset of Corrective Actions

Simulation Results Best results: 85.2% accuracy in 1.5 days for $19.59

Simulation Results Time when action-2 taken is critical Similar results seen for other corrective actions

Conclusion

Thank You

Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI.

Similar presentations

Presentation on theme: "Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI.

Similar presentations

Presentation on theme: "Designing a Human-Machine Hybrid Computing System for Unstructured Data Analytics KOUSHIK SINHA, GEETHA MAJUNATH, BIDYUT GUPTA, SHAHRAM RAHIMI."— Presentation transcript:

Similar presentations

About project

Feedback