Modeling Performance.

Modeling Performance

In addition to solving the model quickly, getting the answer also requires building the model quickly In this section we present some methods used to build models using docplex in python. The constraint we are building is for a classroom scheduling model. In this model, we cannot have any potential sections overlap at any time for a specific room. 𝑠∈𝑜 𝑡 \t 𝐶 𝑠 ≤ 𝑠∈𝑜 𝑡 \t − 𝑠∈𝑜 𝑡 \t 𝑢∈𝜕(𝑡) 𝐶 𝑢 ∀ 𝑡∈𝑡𝑖𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑𝑠

We apply five approaches
We want to use as much out-of-the box functionality as possible Looping: In looping we iterate through lists and add variables to create the constraints Pandas: We leverage the pandas package to do the looping with single commands and group the data together Multiprocessing: We use the python multiprocessing package to send multiple jobs to run in parallel. Pandas, revisited: We adjust how we sum value and observe the performance change. Dask: use the python dask library to compute sums from the data frames in parallel We randomly generate three potential sections (start times) for 10,000 different classes and then for 1,000 different classes. We build the constraint to ensure there are no overlaps. The timing does not include the time required to create the data structures.

Looping for t in potentialTimes.timeID.unique(): #find the times that overlap with the given time lhsVars = [] for idx, row in overlappingTimes[(overlappingTimes.timeID_x == t) & (overlappingTimes.timeID_y != t)].iterrows(): for idx2, row2 in potentialTimes[potentialTimes.timeID == row.timeID_y].iterrows(): lhsVars.append(row2.scheduleSection) rhsVars = [] for idx3, row3 in potentialTimes[potentialTimes.timeID == t].iterrows(): rhsVars.append(row3.scheduleSection) m.add_constraint(m.sum(lhsVars) <= len(lhsVars) - len(lhsVars) * m.sum(rhsVars), 'overlap_%s'%t)

Looping results 10,000 classes 𝜇=3.18,17.40 𝜎=0.11, 0.16 𝑛=10, 5
𝜇=1.00, 2.03 𝜎=0.026, 0.09 𝑛=10, 5

Pandas gb = potentialTimes.groupby(by='timeID').scheduleSection.sum() overlaps = overlappingTimes[overlappingTimes.timeID_x != overlappingTimes.timeID_y].merge(potentialTimes,how='inner',left_on = 'timeID_y',right_on='timeID') numOverlaps = overlaps.groupby(by='timeID_x').timeID_y.count() overlappingSections = overlaps.groupby(by='timeID_x').scheduleSection.sum() m.add_constraints([(overlappingSections[idx] <= numOverlaps[idx] - numOverlaps[idx] * sections,'overlap_%s'%idx) for idx, sections in gb.iteritems()])

Pandas results 10,000 classes 𝜇=14.26, 15.02 𝜎=0.20, 0.38 𝑛=10, 5
The pandas coding is more compact, but requires getting used to creating merges and groupbys. Pandas is slowed by the merge, which in this case is close to a Cartesian multiplier. 1,000 classes 𝜇=0.27, 0.27 𝜎=0.02 𝑛=5

Multiprocessing # Create queues task_queue = Queue() done_queue = Queue() # Start worker processes for i in range(NUMBER_OF_PROCESSES): Process(target=worker, args=(task_queue, done_queue)).start() NUM_TASKS = 0 task_queue.put((aggregate, potentialTimes[['timeID','varNames’]], 'timeID’, 'varNames’, 'gb'))) gb = 0 NUM_TASKS += 1 task_queue.put((merge, (overlappingTimes[overlappingTimes.timeID_x != overlappingTimes.timeID_y],potentialTimes[['timeID','varNames']],'inner','timeID_y','timeID','overlaps'))) overlaps = 0 overlappingSections = 0 numOverlaps = 0 for idx in range(NUM_TASKS): result = done_queue.get() if result[0] == 'gb': gb = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'overlaps': overlaps = result[1] task_queue.put((aggregate, (result[1],'timeID_x','varNames','overlappingSections'))) task_queue.put((getCount, (result[1],'timeID_x','varNames','numOverlaps'))) if result[0] == 'overlappingSections': overlappingSections = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'numOverlaps': numOverlaps = result[1] m.add_constraints([(overlappingSections[idx] <= numOverlaps[idx] - numOverlaps[idx] * sections,'overlap_%s'%idx)\ for idx, sections in gb.iteritems()])

Multiprocessing continued
task_queue.put((merge, (overlappingTimes[overlappingTimes.timeID_x != overlappingTimes.timeID_y], potentialTimes[['timeID','varNames’]], 'inner’, 'timeID_y’, 'timeID’, 'overlaps'))) overlaps = 0 NUM_TASKS += 1 overlappingSections = 0 numOverlaps = 0 for idx in range(NUM_TASKS): result = done_queue.get() if result[0] == 'gb': gb = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'overlaps': overlaps = result[1] task_queue.put((aggregate, (result[1],'timeID_x','varNames','overlappingSections'))) task_queue.put((getCount, (result[1],'timeID_x','varNames','numOverlaps'))) if result[0] == 'overlappingSections': overlappingSections = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'numOverlaps': numOverlaps = result[1] m.add_constraints([(overlappingSections[idx] <= numOverlaps[idx] - numOverlaps[idx] * sections,'overlap_%s'%idx)\ for idx, sections in gb.iteritems()])

for idx in range(NUM_TASKS): result = done_queue.get() if result[0] == 'gb': gb = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'overlaps': overlaps = result[1] task_queue.put((aggregate, (result[1], 'timeID_x’, 'varNames’, 'overlappingSections'))) task_queue.put((getCount, (result[1], 'timeID_x’, 'varNames’, 'numOverlaps'))) if result[0] == 'overlappingSections': overlappingSections = result[1].map(lambda x: m.sum(getVarList(m,x,'!'))) if result[0] == 'numOverlaps': numOverlaps = result[1] m.add_constraints([(overlappingSections[idx] <= numOverlaps[idx] - numOverlaps[idx] * sections,'overlap_%s'%idx)\ for idx, sections in gb.iteritems()])

m.add_constraints([(overlappingSections[idx] <= numOverlaps[idx] - numOverlaps[idx] * sections,'overlap_%s'%idx) for idx, sections in gb.iteritems()])

Multiprocessing results
10,000 classes 𝜇=2.90, 2.79 𝜎=0.23, 0.10 𝑛=5, 10 Multiprocessing requires far more coding, but can leverage more of the processors on your system. Docplex variable objects cannot be pickled (used to store until the multiprocessing can pick up the data to process), so the variable names must be passed back and forth. As a result, packages such as dask will not work to aggregate docplex variables in a DataFrame. There is some overhead in creating the queues and starting the processes, so it is not always the fastest. 1,000 classes 𝜇=1.70, 1.71 𝜎=0.03, 0.04 𝑛=5, 10

Pandas with text “summing”
gb = potentialTimes.groupby(by='timeID').varNames.sum() overlaps = overlappingTimes[overlappingTimes.timeID_x != overlappingTimes.timeID_y].merge(potentialTimes[['timeID','varNames']],how='inner',left_on = 'timeID_y',right_on='timeID') numOverlaps = overlaps.groupby(by='timeID_x').timeID_y.count() overlappingSections = overlaps.groupby(by='timeID_x').varNames.sum() m.add_constraints([(m.sum(getVarList(m,overlappingSections[idx],'!')) <= numOverlaps[idx] \ - numOverlaps[idx] * m.sum(getVarList(m,sections,'!')),'overlap_%s'%idx)\ for idx, sections in gb.iteritems()]) def getVarList(model,namesList,delimiter): return [model.get_var_by_name(name) for name in namesList.split(delimiter)[:-1]]

Pandas with text “summing” results
10,000 classes 𝜇= 1.18 𝜎=0.07 𝑛= 10 The pandas coding is more compact, but requires getting used to creating merges and groupbys. Pandas is slowed by the merge, which in this case is close to a Cartesian multiplier. Summation of string values is faster much faster in pandas and translation from name to variable object is fast in python. 1,000 classes 𝜇=0.091 𝜎=0.003 𝑛=10

Dask with text “summing”
pt = dd.from_pandas(potentialTimes,npartitions = 8) ot = dd.from_pandas(overlappingTimes, npartitions = 8) gb = pt.groupby(by='timeID').varNames.sum() overlaps = ot[ot.timeID_x != ot.timeID_y].merge(pt[['timeID','varNames']],how='inner',left_on = 'timeID_y',right_on='timeID') numOverlaps = overlaps.groupby(by='timeID_x').timeID_y.count() numOverlaps = numOverlaps.compute() overlappingSections = overlaps.groupby(by='timeID_x').varNames.sum() overlappingSections = overlappingSections.compute() m.add_constraints([(m.sum(getVarList(m,overlappingSections[idx],'!')) <= numOverlaps[idx] \ - numOverlaps[idx] * m.sum(getVarList(m,sections,'!')),'overlap_%s'%idx)\ for idx, sections in gb.compute().iteritems()]) def getVarList(model,namesList,delimiter): return [model.get_var_by_name(name) for name in namesList.split(delimiter)[:-1]]

Dask with text “summing” results
10,000 classes 𝜇= 1.28, 1.30, 1.75 𝜎=0.059, 0.039, 0.101 𝑛= 10 The dask coding requires some conversion back to pandas for some of the constraint generation. As a result, it slows down in implementation. The number of partitions affects performance. Results represent two, four, and eight partitions. 1,000 classes 𝜇=0.29, 0.44, 0.76 𝜎=0.059, , 𝑛=10

Conclusion Docplex supports many ways to build model constraints.
It may be easiest to start with an intuitive method and migrate to a fast method. Leverage python packages such as pandas. Dask may hold promise for very large problems.

Modeling Performance.

Similar presentations

Presentation on theme: "Modeling Performance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling Performance.

Similar presentations

Presentation on theme: "Modeling Performance."— Presentation transcript:

Similar presentations

About project

Feedback